TL;DR — cgroups v2 offers a unified hierarchy, richer controllers, and easier accounting. By switching systemd, Docker, and Kubernetes to v2 you gain deterministic CPU, memory, and I/O limits while simplifying observability.

Modern Linux distributions ship with cgroups v2 enabled by default, yet many production teams still run workloads on the legacy v1 hierarchy. This mismatch creates hidden fragmentation: some pods get precise throttling, others fall back to coarse limits. In this post we walk through the architecture of cgroups v2, outline a step‑by‑step migration path, and provide concrete configuration snippets for systemd, Docker, and Kubernetes. You’ll leave with a checklist you can run in a staging cluster today.

Why cgroups v2 Matters

  • Unified hierarchy – All controllers live under a single tree, eliminating the “split‑brain” problem of v1 where cpu and memory lived in different subsystems.
  • Improved accounting – Accurate per‑cgroup I/O statistics via io.stat and unified cpu.max for throttling.
  • Simpler delegation – Child cgroups inherit limits automatically; no need to manually sync cpu.shares and cpu.cfs_quota_us.
  • Future‑proof – New controllers (e.g., pids, pressure) are added only to v2, making it the only path for upcoming kernel features.

Real‑world impact: a 2024 benchmark from the CNCF showed that switching a 200‑node GKE cluster from v1 to v2 reduced tail‑latency for CPU‑bound microservices by 12 % and cut memory‑overcommit alerts by 30 % (Google Cloud blog).

Understanding cgroups v2 Fundamentals

The Single‑Tree Model

In v2 the root of the hierarchy is mounted at /sys/fs/cgroup. Each controller appears as a file in that directory:

/sys/fs/cgroup/
├─ cpu.max
├─ memory.max
├─ io.max
├─ pids.max
└─ <sub‑cgroup>
  • cpu.max<quota> <period> pair, e.g. 200000 100000 (200 ms of CPU time per 100 ms period).
  • memory.max – hard limit in bytes, max for unlimited.
  • io.max<device> <rbps> <wbps> throttling per block device.

Key Controllers

ControllerPrimary UseExample File
cpuTime‑share schedulingcpu.max
memoryHard/soft limits, OOM killmemory.max, memory.swap.max
ioBlock I/O throttlingio.max
pidsProcess count cappids.max
pressureResource pressure metricsmemory.pressure, cpu.pressure

Observability

Reading a controller is as simple as cat:

$ cat /sys/fs/cgroup/memory.max
8G
$ cat /sys/fs/cgroup/cpu.max
50000 100000

Tools such as systemd-cgtop, cgroupfs-mount, and cgroup2-tools expose these files in a friendly UI. Prometheus exporters (e.g., node_exporter v1.5+) now scrape v2 metrics natively.

Migration Path from v1 to v2

1. Verify Kernel Support

$ uname -r
6.6.9-arch1-1
$ grep cgroup /boot/config-$(uname -r) | grep V2
CONFIG_CGROUP_V2=y

If CONFIG_CGROUP_V2 is missing, upgrade the kernel or enable the module.

2. Enable the Unified Hierarchy

Add to the kernel command line (GRUB):

systemd.unified_cgroup_hierarchy=1

Then regenerate GRUB config and reboot:

sudo grub-mkconfig -o /boot/grub/grub.cfg
sudo reboot

After reboot, confirm:

$ mount | grep cgroup2
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)

3. Switch Systemd to v2 (default on most distros)

Systemd automatically adopts the unified hierarchy when the kernel flag is set. Verify with:

$ systemctl show --property=Delegate / | grep Delegate
Delegate=yes

If you run a custom init system, you’ll need to mount cgroup2 manually:

sudo mount -t cgroup2 none /sys/fs/cgroup

4. Update Container Runtimes

Docker

Docker 20.10+ supports v2 via the systemd cgroup driver. Edit /etc/docker/daemon.json:

{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "default-cgroupns-mode": "private"
}

Then restart Docker:

sudo systemctl restart docker

Verify:

$ docker info | grep -i cgroup
Cgroup Driver: systemd
Cgroup Version: 2

containerd

For containerd (used by Kubernetes), set the cgroup_path in /etc/containerd/config.toml:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

Restart:

sudo systemctl restart containerd

Kubernetes

Kubelet picks up the cgroup driver from the container runtime. Add the flag:

--cgroup-driver=systemd

or in kubelet-config.yaml:

cgroupDriver: systemd
cgroupRoot: ""

Check node status:

kubectl describe node $(hostname) | grep -i cgroup
Cgroup Driver: systemd
Cgroup Version: v2

5. Validate Limits

Deploy a test pod with explicit limits:

apiVersion: v1
kind: Pod
metadata:
  name: stress-cpu
spec:
  containers:
  - name: stress
    image: alpine
    command: ["sh", "-c", "while true; do :; done"]
    resources:
      limits:
        cpu: "500m"
        memory: "256Mi"

After scheduling, inspect the cgroup files inside the container’s namespace:

docker exec -it $(docker ps -q -f name=stress-cpu) cat /sys/fs/cgroup/cpu.max
50000 100000

If the values match, the migration succeeded.

Architecture of Resource Controllers in Container Runtimes

Systemd‑Managed Pods

When systemd is the cgroup driver, each pod receives a transient systemd slice, e.g.:

/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<uid>.slice/
  • Advantages: automatic delegation, per‑slice CPUQuota, MemoryMax, and IOWeight.
  • Failure modes: If a pod’s slice is not marked Delegate=yes, child containers cannot create sub‑cgroups, leading to “permission denied” errors during start‑up.

Example systemd unit for a pod slice

[Unit]
Description=Kubernetes pod abc123
Slice=kubepods-burstable-podabc123.slice
Delegate=yes

[Service]
CPUQuota=50%
MemoryMax=512M
IOWeight=500

Kubelet creates these slices on‑the‑fly; you can inspect them with systemctl status.

runc Integration

runc uses the cgroupfs or systemd driver to populate the cgroup tree. Under v2, the runtime writes directly to the unified files:

// Simplified snippet from runc/libcontainer/cgroups/v2/cpu.go
func setCPU(c *configs.Cgroup, pid int) error {
    max := fmt.Sprintf("%d %d", c.CpuQuota, c.CpuPeriod)
    return writeFile(fmt.Sprintf("/proc/%d/cgroup", pid), "cpu.max", max)
}

Because the API is uniform, runc can expose a single --cpu-max flag to the CLI, reducing the surface area for user error.

Observability Pipeline

A typical production stack:

Kubelet → containerd → runc → cgroup v2 → node_exporter → Prometheus → Grafana
  • node_exporter scrapes cgroup_cpu_seconds_total, cgroup_memory_max_bytes, cgroup_io_service_bytes_total.
  • Alert rules use cgroup_memory_pressure to trigger OOM mitigation.

Implementation Strategies

1. Adopt a “One‑Slice‑Per‑Namespace” Policy

  • Goal: Isolate each tenant (team, customer, or microservice) in its own systemd slice.
  • How: Extend the kubelet admission controller to inject a custom Slice= annotation based on a label (e.g., team=paymentskubepods-team-payments.slice).
  • Benefit: Enables per‑team QoS without touching individual pod specs.

2. Leverage cpu.max for Burstable Workloads

Instead of the legacy cpu.shares, set explicit cpu.max quotas:

resources:
  limits:
    cpu: "800m"
  requests:
    cpu: "200m"

Kubelet translates the request into a soft limit using cpu.max with a high period (e.g., 800000 1000000). This yields deterministic burst behavior while avoiding the “share‑drift” problem of v1.

3. Fine‑Tune I/O with io.max

Identify hot storage devices (e.g., /dev/nvme0n1) and create a per‑pod I/O policy:

apiVersion: v1
kind: Pod
metadata:
  name: db‑writer
spec:
  containers:
  - name: writer
    image: postgres
    resources:
      limits:
        memory: "2Gi"
    volumeMounts:
    - mountPath: /var/lib/postgresql/data
      name: pgdata
  volumes:
  - name: pgdata
    persistentVolumeClaim:
      claimName: pgdata-pvc

After the pod starts, apply I/O limits with a one‑shot systemd-run command:

systemd-run --slice=kubepods.slice \
  --property=IOReadBandwidthMax=/dev/nvme0n1 50M \
  --property=IOWriteBandwidthMax=/dev/nvme0n1 30M \
  true

4. Enforce Process Count (pids.max) for Isolation

A runaway fork bomb can exhaust kernel PID space. Set a sane cap per container:

resources:
  limits:
    cpu: "1"
    memory: "1Gi"
    pids: "100"

Kubernetes 1.27+ propagates pids limits to the pids.max file automatically.

5. Use Pressure Stall Information (PSI) for Proactive Scaling

memory.pressure and cpu.pressure expose the fraction of time the system spent throttled. Create a Prometheus rule:

- alert: HighMemoryPressure
  expr: avg_over_time(node_memory_pressure_seconds_total[5m]) > 0.8
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "Memory pressure > 80% on {{ $labels.instance }}"
    description: "Consider scaling out or increasing memory limits."

When the alert fires, an automated Horizontal Pod Autoscaler (HPA) can increase replica count, preventing OOM kills.

Performance Tuning and Observability

Benchmarking CPU Throttling

Run a stress-ng job inside a pod with varying cpu.max values and collect cpu.stat:

# Inside the container
cat /sys/fs/cgroup/cpu.stat

Typical output:

usage_usec 12345678
user_usec  11223344
system_usec 1122334
nr_periods 2000
nr_throttled 5
throttled_usec 25000

Interpretation:

  • nr_throttled > 0 indicates the quota is being hit.
  • throttled_usec shows total time spent waiting.

Adjust cpu.max until nr_throttled stays below a threshold (e.g., 1 % of periods).

Memory Pressure Monitoring

memory.pressure provides three metrics: some, full, and avg10. Example:

cat /sys/fs/cgroup/memory.pressure
some avg10=0.02 avg60=0.01 avg300=0.00 total=12345
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

A rising some value signals that processes are waiting for memory but not yet OOM‑killed. Pair this with cgroup_memory_events from node_exporter to spot trends.

Visualization in Grafana

Create a dashboard panel with the PromQL:

rate(node_cgroup_cpu_throttled_seconds_total[1m])

Overlay the memory_pressure_seconds_total series to correlate CPU throttling spikes with memory pressure events.

Key Takeaways

  • Unified hierarchy eliminates cross‑controller inconsistencies and simplifies limit propagation.
  • Systemd slices provide a clean, declarative way to delegate resources per pod or team.
  • Explicit cpu.max and memory.max give deterministic throttling, avoiding the fuzzy “shares” model.
  • I/O and PID controllers are first‑class citizens in v2; use them to guard against noisy neighbor attacks.
  • Pressure Stall Information is a powerful early‑warning signal for scaling decisions.
  • Migration checklist: kernel support → unified mount → systemd → Docker/containerd → Kubernetes → validation.

Further Reading