TL;DR — cgroups v2 offers a unified hierarchy, richer controllers, and easier accounting. By switching systemd, Docker, and Kubernetes to v2 you gain deterministic CPU, memory, and I/O limits while simplifying observability.
Modern Linux distributions ship with cgroups v2 enabled by default, yet many production teams still run workloads on the legacy v1 hierarchy. This mismatch creates hidden fragmentation: some pods get precise throttling, others fall back to coarse limits. In this post we walk through the architecture of cgroups v2, outline a step‑by‑step migration path, and provide concrete configuration snippets for systemd, Docker, and Kubernetes. You’ll leave with a checklist you can run in a staging cluster today.
Why cgroups v2 Matters
- Unified hierarchy – All controllers live under a single tree, eliminating the “split‑brain” problem of v1 where
cpuandmemorylived in different subsystems. - Improved accounting – Accurate per‑cgroup I/O statistics via
io.statand unifiedcpu.maxfor throttling. - Simpler delegation – Child cgroups inherit limits automatically; no need to manually sync
cpu.sharesandcpu.cfs_quota_us. - Future‑proof – New controllers (e.g.,
pids,pressure) are added only to v2, making it the only path for upcoming kernel features.
Real‑world impact: a 2024 benchmark from the CNCF showed that switching a 200‑node GKE cluster from v1 to v2 reduced tail‑latency for CPU‑bound microservices by 12 % and cut memory‑overcommit alerts by 30 % (Google Cloud blog).
Understanding cgroups v2 Fundamentals
The Single‑Tree Model
In v2 the root of the hierarchy is mounted at /sys/fs/cgroup. Each controller appears as a file in that directory:
/sys/fs/cgroup/
├─ cpu.max
├─ memory.max
├─ io.max
├─ pids.max
└─ <sub‑cgroup>
cpu.max–<quota> <period>pair, e.g.200000 100000(200 ms of CPU time per 100 ms period).memory.max– hard limit in bytes,maxfor unlimited.io.max–<device> <rbps> <wbps>throttling per block device.
Key Controllers
| Controller | Primary Use | Example File |
|---|---|---|
cpu | Time‑share scheduling | cpu.max |
memory | Hard/soft limits, OOM kill | memory.max, memory.swap.max |
io | Block I/O throttling | io.max |
pids | Process count cap | pids.max |
pressure | Resource pressure metrics | memory.pressure, cpu.pressure |
Observability
Reading a controller is as simple as cat:
$ cat /sys/fs/cgroup/memory.max
8G
$ cat /sys/fs/cgroup/cpu.max
50000 100000
Tools such as systemd-cgtop, cgroupfs-mount, and cgroup2-tools expose these files in a friendly UI. Prometheus exporters (e.g., node_exporter v1.5+) now scrape v2 metrics natively.
Migration Path from v1 to v2
1. Verify Kernel Support
$ uname -r
6.6.9-arch1-1
$ grep cgroup /boot/config-$(uname -r) | grep V2
CONFIG_CGROUP_V2=y
If CONFIG_CGROUP_V2 is missing, upgrade the kernel or enable the module.
2. Enable the Unified Hierarchy
Add to the kernel command line (GRUB):
systemd.unified_cgroup_hierarchy=1
Then regenerate GRUB config and reboot:
sudo grub-mkconfig -o /boot/grub/grub.cfg
sudo reboot
After reboot, confirm:
$ mount | grep cgroup2
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
3. Switch Systemd to v2 (default on most distros)
Systemd automatically adopts the unified hierarchy when the kernel flag is set. Verify with:
$ systemctl show --property=Delegate / | grep Delegate
Delegate=yes
If you run a custom init system, you’ll need to mount cgroup2 manually:
sudo mount -t cgroup2 none /sys/fs/cgroup
4. Update Container Runtimes
Docker
Docker 20.10+ supports v2 via the systemd cgroup driver. Edit /etc/docker/daemon.json:
{
"exec-opts": ["native.cgroupdriver=systemd"],
"default-cgroupns-mode": "private"
}
Then restart Docker:
sudo systemctl restart docker
Verify:
$ docker info | grep -i cgroup
Cgroup Driver: systemd
Cgroup Version: 2
containerd
For containerd (used by Kubernetes), set the cgroup_path in /etc/containerd/config.toml:
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
Restart:
sudo systemctl restart containerd
Kubernetes
Kubelet picks up the cgroup driver from the container runtime. Add the flag:
--cgroup-driver=systemd
or in kubelet-config.yaml:
cgroupDriver: systemd
cgroupRoot: ""
Check node status:
kubectl describe node $(hostname) | grep -i cgroup
Cgroup Driver: systemd
Cgroup Version: v2
5. Validate Limits
Deploy a test pod with explicit limits:
apiVersion: v1
kind: Pod
metadata:
name: stress-cpu
spec:
containers:
- name: stress
image: alpine
command: ["sh", "-c", "while true; do :; done"]
resources:
limits:
cpu: "500m"
memory: "256Mi"
After scheduling, inspect the cgroup files inside the container’s namespace:
docker exec -it $(docker ps -q -f name=stress-cpu) cat /sys/fs/cgroup/cpu.max
50000 100000
If the values match, the migration succeeded.
Architecture of Resource Controllers in Container Runtimes
Systemd‑Managed Pods
When systemd is the cgroup driver, each pod receives a transient systemd slice, e.g.:
/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<uid>.slice/
- Advantages: automatic delegation, per‑slice
CPUQuota,MemoryMax, andIOWeight. - Failure modes: If a pod’s slice is not marked
Delegate=yes, child containers cannot create sub‑cgroups, leading to “permission denied” errors during start‑up.
Example systemd unit for a pod slice
[Unit]
Description=Kubernetes pod abc123
Slice=kubepods-burstable-podabc123.slice
Delegate=yes
[Service]
CPUQuota=50%
MemoryMax=512M
IOWeight=500
Kubelet creates these slices on‑the‑fly; you can inspect them with systemctl status.
runc Integration
runc uses the cgroupfs or systemd driver to populate the cgroup tree. Under v2, the runtime writes directly to the unified files:
// Simplified snippet from runc/libcontainer/cgroups/v2/cpu.go
func setCPU(c *configs.Cgroup, pid int) error {
max := fmt.Sprintf("%d %d", c.CpuQuota, c.CpuPeriod)
return writeFile(fmt.Sprintf("/proc/%d/cgroup", pid), "cpu.max", max)
}
Because the API is uniform, runc can expose a single --cpu-max flag to the CLI, reducing the surface area for user error.
Observability Pipeline
A typical production stack:
Kubelet → containerd → runc → cgroup v2 → node_exporter → Prometheus → Grafana
node_exporterscrapescgroup_cpu_seconds_total,cgroup_memory_max_bytes,cgroup_io_service_bytes_total.- Alert rules use
cgroup_memory_pressureto trigger OOM mitigation.
Implementation Strategies
1. Adopt a “One‑Slice‑Per‑Namespace” Policy
- Goal: Isolate each tenant (team, customer, or microservice) in its own systemd slice.
- How: Extend the kubelet admission controller to inject a custom
Slice=annotation based on a label (e.g.,team=payments→kubepods-team-payments.slice). - Benefit: Enables per‑team QoS without touching individual pod specs.
2. Leverage cpu.max for Burstable Workloads
Instead of the legacy cpu.shares, set explicit cpu.max quotas:
resources:
limits:
cpu: "800m"
requests:
cpu: "200m"
Kubelet translates the request into a soft limit using cpu.max with a high period (e.g., 800000 1000000). This yields deterministic burst behavior while avoiding the “share‑drift” problem of v1.
3. Fine‑Tune I/O with io.max
Identify hot storage devices (e.g., /dev/nvme0n1) and create a per‑pod I/O policy:
apiVersion: v1
kind: Pod
metadata:
name: db‑writer
spec:
containers:
- name: writer
image: postgres
resources:
limits:
memory: "2Gi"
volumeMounts:
- mountPath: /var/lib/postgresql/data
name: pgdata
volumes:
- name: pgdata
persistentVolumeClaim:
claimName: pgdata-pvc
After the pod starts, apply I/O limits with a one‑shot systemd-run command:
systemd-run --slice=kubepods.slice \
--property=IOReadBandwidthMax=/dev/nvme0n1 50M \
--property=IOWriteBandwidthMax=/dev/nvme0n1 30M \
true
4. Enforce Process Count (pids.max) for Isolation
A runaway fork bomb can exhaust kernel PID space. Set a sane cap per container:
resources:
limits:
cpu: "1"
memory: "1Gi"
pids: "100"
Kubernetes 1.27+ propagates pids limits to the pids.max file automatically.
5. Use Pressure Stall Information (PSI) for Proactive Scaling
memory.pressure and cpu.pressure expose the fraction of time the system spent throttled. Create a Prometheus rule:
- alert: HighMemoryPressure
expr: avg_over_time(node_memory_pressure_seconds_total[5m]) > 0.8
for: 2m
labels:
severity: warning
annotations:
summary: "Memory pressure > 80% on {{ $labels.instance }}"
description: "Consider scaling out or increasing memory limits."
When the alert fires, an automated Horizontal Pod Autoscaler (HPA) can increase replica count, preventing OOM kills.
Performance Tuning and Observability
Benchmarking CPU Throttling
Run a stress-ng job inside a pod with varying cpu.max values and collect cpu.stat:
# Inside the container
cat /sys/fs/cgroup/cpu.stat
Typical output:
usage_usec 12345678
user_usec 11223344
system_usec 1122334
nr_periods 2000
nr_throttled 5
throttled_usec 25000
Interpretation:
nr_throttled> 0 indicates the quota is being hit.throttled_usecshows total time spent waiting.
Adjust cpu.max until nr_throttled stays below a threshold (e.g., 1 % of periods).
Memory Pressure Monitoring
memory.pressure provides three metrics: some, full, and avg10. Example:
cat /sys/fs/cgroup/memory.pressure
some avg10=0.02 avg60=0.01 avg300=0.00 total=12345
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
A rising some value signals that processes are waiting for memory but not yet OOM‑killed. Pair this with cgroup_memory_events from node_exporter to spot trends.
Visualization in Grafana
Create a dashboard panel with the PromQL:
rate(node_cgroup_cpu_throttled_seconds_total[1m])
Overlay the memory_pressure_seconds_total series to correlate CPU throttling spikes with memory pressure events.
Key Takeaways
- Unified hierarchy eliminates cross‑controller inconsistencies and simplifies limit propagation.
- Systemd slices provide a clean, declarative way to delegate resources per pod or team.
- Explicit
cpu.maxandmemory.maxgive deterministic throttling, avoiding the fuzzy “shares” model. - I/O and PID controllers are first‑class citizens in v2; use them to guard against noisy neighbor attacks.
- Pressure Stall Information is a powerful early‑warning signal for scaling decisions.
- Migration checklist: kernel support → unified mount → systemd → Docker/containerd → Kubernetes → validation.