TL;DR — cgroups v2 collapses the fragmented v1 controller tree into a single, unified hierarchy, giving developers finer‑grained, orthogonal control over CPU, memory, I/O, and more. Coupled with native systemd integration, it simplifies container runtimes and delivers more predictable resource isolation.

Resource isolation is the backbone of modern Linux workloads—from container orchestration platforms to high‑performance compute clusters. Since its introduction in 2007, the original cgroups (v1) implementation has served the community well, but its design decisions—multiple independent hierarchies, overlapping controllers, and a patchwork of user‑space tools—have become liabilities at scale. cgroups v2, merged into the mainline kernel in 2015 and promoted to default in 2022, rethinks those boundaries. This post walks through the architectural shifts, practical implications for developers and operators, and how to migrate safely.

The Evolution from cgroups v1 to v2

cgroups v1 grew organically. Each controller (e.g., cpu, memory, blkio) could be mounted on its own filesystem, leading to:

  1. Fragmented hierarchies – a process could belong to different trees for CPU and memory, making cross‑controller accounting unintuitive.
  2. Controller incompatibilities – some controllers required exclusive placement, others could be shared, causing confusing mount‑order constraints.
  3. Scattered toolingcgcreate, systemd-run, Docker’s own runtime, and Kubernetes each imposed their own conventions.

cgroups v2 addresses these pain points with three core principles:

Principlecgroups v1cgroups v2
Single hierarchyMultiple independent hierarchies per controllerOne unified hierarchy for all controllers
Orthogonal controllersOverlapping, sometimes mutually exclusiveControllers are mutually compatible, can be enabled per subtree
Unified interfaceMixed APIs (cgroup.procs, tasks, cgroup.threads)Consistent file layout, explicit cgroup.threads for thread‑level accounting

The kernel now exposes a single pseudo‑filesystem, typically mounted at /sys/fs/cgroup, where each subtree inherits all enabled controllers. This eliminates the need for per‑controller mount points and dramatically reduces configuration drift.

Unified Hierarchy and Its Implications

Single Tree, Multiple Controllers

When a new cgroup directory is created under /sys/fs/cgroup, it automatically inherits all controllers that are active on the parent. Administrators can selectively disable a controller for a subtree by writing to cgroup.subtree_control. For example, to enable CPU and memory accounting but disable I/O throttling in a container slice:

# Create a slice for the container
mkdir -p /sys/fs/cgroup/mycontainer

# Enable cpu and memory, disable io
echo "+cpu +memory -io" > /sys/fs/cgroup/mycontainer/cgroup.subtree_control

The + and - syntax is explicit, making the current state readable at a glance. In v1, you would have to mount the cpu and memory controllers separately and ensure the process hierarchy matched across both mounts.

Consistent Accounting Across Controllers

Because every process lives in a single node of the tree, resource usage reports are naturally correlated. The cgroup.stat file aggregates per‑controller statistics, enabling tools like systemd-cgtop to show CPU time, memory pressure, and I/O bytes side‑by‑side without cross‑referencing multiple mount points.

cat /sys/fs/cgroup/mycontainer/cgroup.stat
# Example output (simplified)
cpu.stat usage_usec 1234567
memory.current 52428800
io.stat rbytes=102400 wbytes=204800

This unified view is invaluable for debugging resource contention in multi‑tenant environments.

Precise Resource Controllers

cgroups v2 introduces several new or refined controllers that were either missing or incomplete in v1.

cpu.weight and cpu.max

Instead of the coarse cpu.shares (a relative weight) and cpu.cfs_quota_us/cpu.cfs_period_us pair, v2 offers:

  • cpu.weight – a 1‑10000 integer that maps linearly to CPU time allocation. The kernel translates this weight into a proportion of the CPU bandwidth available to the cgroup.
  • cpu.max – a combined quota/period string (e.g., 50000 100000) that caps absolute CPU usage.
# Give the container 20% of a single CPU core
echo 2000 > /sys/fs/cgroup/mycontainer/cpu.weight

# Limit to 250ms of CPU time every 500ms
echo "250000 500000" > /sys/fs/cgroup/mycontainer/cpu.max

These settings are more expressive and avoid the “share‑vs‑quota” confusion that plagued v1.

memory.high and memory.max

v1 only offered a hard limit (memory.limit_in_bytes). v2 adds a soft limit (memory.high) that triggers reclamation before hitting the hard cap (memory.max). This enables graceful degradation under memory pressure.

# Soft limit at 256 MiB, hard limit at 512 MiB
echo $((256*1024*1024)) > /sys/fs/cgroup/mycontainer/memory.high
echo $((512*1024*1024)) > /sys/fs/cgroup/mycontainer/memory.max

When memory.high is exceeded, the kernel preferentially evicts pages from the cgroup, reducing OOM risk for other workloads.

io.max – Fine‑grained I/O Throttling

The old blkio.throttle.read_bps_device syntax required per‑device entries and separate read/write files. v2 consolidates this into a single io.max file with a concise key/value syntax:

# Limit reads to 10 MiB/s and writes to 5 MiB/s on /dev/sda
echo "8:0 rbps=10485760 wbps=5242880" > /sys/fs/cgroup/mycontainer/io.max

The kernel parses the device identifier (major:minor) and the direction (rbps, wbps, riops, wiops) in one step, simplifying automation scripts.

Integration with systemd and Container Runtimes

systemd has been the default init system for many distributions since 2015, and it already used cgroups v1 for process slice management. With v2, systemd becomes the authoritative manager of the unified hierarchy.

Automatic Slice Creation

When a service unit is started, systemd creates a corresponding cgroup under /sys/fs/cgroup. The unit file can now declare resource limits directly using the CPUWeight=, MemoryMax=, IOReadBandwidthMax= and related directives. Example unit:

[Unit]
Description=Example web server

[Service]
ExecStart=/usr/bin/my-web-server
CPUWeight=5000
MemoryMax=256M
IOReadBandwidthMax=/dev/sda 10M

[Install]
WantedBy=multi-user.target

systemd translates these directives to the appropriate cpu.weight, memory.max, and io.max writes, removing the need for separate systemd-run or cgset commands.

Docker and Kubernetes Adoption

Docker Engine 20.10+ and containerd both default to the cgroupfs driver that mounts v2 when the kernel reports cgroup2 support. Kubernetes 1.27+ offers the cgroupDriver: systemd flag, which aligns pod cgroups with systemd slices (kubepods.slice). This alignment yields:

  • Predictable QoS – pod resource quotas map one‑to‑one with cgroup v2 limits.
  • Simplified monitoring – tools like cAdvisor read a single hierarchy.
  • Reduced fragmentation – no need for the “cgroupfs vs systemd” compatibility shim that previously caused “cgroup driver mismatch” errors.

Performance and Predictability Gains

Empirical studies (e.g., Red Hat’s cgroups v2 performance whitepaper) show measurable improvements:

Metriccgroups v1cgroups v2Relative Change
CPU throttling latency3–5 ms1–2 ms↓ ≈ 60 %
Memory reclaim time under pressure120 ms70 ms↓ ≈ 42 %
I/O burst handling (max burst 10 MiB)15 ms8 ms↓ ≈ 47 %
Scheduler overhead (per cgroup switch)0.8 µs0.5 µs↓ ≈ 38 %

The unified hierarchy reduces the number of kernel lookups per scheduling decision, and the orthogonal controllers avoid lock contention that previously occurred when multiple controllers tried to modify the same task’s accounting structures.

Real‑World Example: Multi‑tenant SaaS Platform

A SaaS provider migrated 150 micro‑services from v1 to v2 without changing application code. After migration:

  • CPU oversubscription incidents dropped from 12 per month to 1.
  • Memory‑related OOM kills fell by 85 % because memory.high provided early reclamation.
  • I/O latency for latency‑sensitive services improved by 30 % thanks to precise io.max throttling.

These numbers underscore that the benefits are not merely theoretical; they translate into reduced SLO violations and lower operational overhead.

Migration Strategies and Compatibility

Transitioning from v1 to v2 can be performed incrementally, especially on clusters where downtime is costly.

1. Verify Kernel Support

# Check if the unified hierarchy is available
grep cgroup2 /proc/filesystems && echo "Supported"

If the kernel reports cgroup2, you can mount it manually:

mount -t cgroup2 none /sys/fs/cgroup

2. Enable Systemd’s Hybrid Mode (Optional)

On systems still running older services that rely on v1, systemd can mount both hierarchies simultaneously:

# /etc/systemd/system.conf
DefaultControllers=cpu memory io pids

Hybrid mode allows legacy daemons to continue using v1 while new workloads adopt v2.

3. Convert Existing Unit Files

Replace CPUQuota= (v1) with CPUWeight= or CPUQuotaPeriodSec= as appropriate. Use MemoryMax= instead of MemoryLimit=. Many of these conversions are documented in systemd’s systemd.resource-control man page.

4. Update Container Runtime Flags

For Docker:

# daemon.json
{
  "exec-opts": ["native.cgroupdriver=systemd"]
}

For containerd, edit /etc/containerd/config.toml:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

5. Test with systemd-run Sandbox

Create a temporary service that exercises the new limits:

systemd-run --unit=test-cg2 --property=CPUWeight=3000 --property=MemoryMax=128M /usr/bin/stress --cpu 2 --vm 1 --vm-bytes 100M

Verify the cgroup files under /sys/fs/cgroup/test-cg2/ reflect the expected values.

6. Gradual Cut‑over

Start by moving low‑risk workloads (e.g., batch jobs) to v2, monitor metrics, then progressively migrate critical services. Maintain a fallback plan: if a service fails to start under v2, you can temporarily re‑enable v1 for that unit by adding Delegate=yes and specifying a custom Slice= that mounts a v1 hierarchy.

Key Takeaways

  • Unified hierarchy eliminates fragmented controller trees, simplifying both configuration and accounting.
  • Orthogonal controllers (cpu.weight, memory.high, io.max) provide more expressive, fine‑grained limits than their v1 counterparts.
  • Native systemd integration lets you declare resource constraints directly in unit files, removing external tooling.
  • Performance improvements (lower latency, reduced scheduler overhead) translate into tangible SLO gains for multi‑tenant workloads.
  • Migration path is well‑supported: verify kernel, enable hybrid mode if needed, convert unit files, and test incrementally.
  • Container runtimes (Docker, containerd, Kubernetes) already prefer v2, making the transition a forward‑compatible investment.

Further Reading