TL;DR — Cgroups v2 replaces the fragmented v1 hierarchy with a single, coherent tree, adds fine‑grained I/O and memory controls, and integrates tightly with systemd and Kubernetes. The result is more predictable performance, easier administration, and a stronger foundation for modern cloud‑native workloads.

Linux has long relied on control groups (cgroups) to partition CPU, memory, I/O, and other resources among processes. The original implementation, cgroups v1, proved invaluable but suffered from a fragmented design that made advanced isolation painful. With the introduction of cgroups v2 in kernel 4.5, the Linux community received a clean‑slate redesign that addresses those pain points and aligns the kernel with the realities of containers, micro‑VMs, and serverless functions. This article walks through the historical context, the technical breakthroughs of v2, and the concrete impact on today’s cloud‑native ecosystems.

The Evolution from Cgroups v1 to V2

Limitations of v1

Cgroups v1 was built as a collection of independent hierarchies—one per subsystem (cpu, memory, blkio, etc.). While this modularity allowed early adopters to enable only the controllers they needed, it also introduced several operational headaches:

  1. Inconsistent Tree Structures – Each controller maintained its own directory tree under /sys/fs/cgroup. Aligning a process across CPU, memory, and I/O required creating matching sub‑directories in each hierarchy, a source of subtle bugs.
  2. Controller Interference – Because controllers were separate, a mis‑configured memory limit could be silently bypassed by a process that escaped to a different hierarchy.
  3. Limited Feature Set – The blkio controller, for example, offered only coarse throttling via blkio.throttle.read_bps_device. Fine‑grained per‑process I/O weighting was impossible.
  4. Complex Delegation – Granting unprivileged users ownership of a subtree required juggling multiple cgroup.procs files across hierarchies, often leading to permission errors.

These issues became especially visible as containers proliferated. Orchestrators such as Docker and Kubernetes needed a predictable, single source of truth for resource limits, but had to implement work‑arounds to keep v1’s fragmented state in sync.

Design Goals of v2

The kernel developers set out to solve the above problems with a clear set of goals:

  • Unified Hierarchy – All controllers share a single tree, eliminating duplication and ensuring that a process’s resource limits are co‑located.
  • Rich, Consistent Interfaces – Controllers expose a modern, file‑based API that mirrors the design of other kernel subsystems (e.g., cgroup2.<controller>.max).
  • Better Delegation & Security – The cgroup.subtree_control file lets a parent explicitly enable or disable child controllers, making delegation safe and auditable.
  • Performance‑Oriented Defaults – Controllers are tuned for low‑overhead accounting, and the kernel can batch updates to avoid excessive context switches.

The result is a more ergonomic, extensible platform that fits naturally with systemd’s unit model and Kubernetes’ pod abstraction.

Core Features of Cgroups V2

Unified Hierarchy

In v2 there is exactly one mount point, typically at /sys/fs/cgroup. All controllers are enabled on that mount and expose their settings under a common directory structure:

# Example: creating a new cgroup for a web service
sudo mkdir -p /sys/fs/cgroup/websvc
echo $$ > /sys/fs/cgroup/websvc/cgroup.procs

The cgroup.procs file now holds the PIDs for all controllers, guaranteeing that CPU, memory, and I/O limits are applied together. This eliminates the “duplicate subtree” problem that plagued v1.

Enhanced Memory Controller

V2’s memory controller consolidates several v1 files (memory.limit_in_bytes, memory.soft_limit_in_bytes, memory.kmem.limit_in_bytes) into a single, expressive interface:

# Set a hard memory limit of 2 GiB
echo 2147483648 > memory.max

# Set a soft limit (optional) that triggers reclamation before the hard limit
echo 1G > memory.high

The memory.swap.max file adds explicit swap throttling, while memory.pressure_level provides a per‑cgroup pressure metric that can be polled by orchestration tools to trigger proactive scaling.

I/O Controller Improvements

The legacy blkio controller’s coarse throttling is replaced by the io controller, which supports per‑device, per‑process bandwidth and IOPS limits:

# Grant 50 MiB/s read bandwidth on /dev/sda for the entire cgroup
echo "8:0 rbps=52428800" > io.max

# Limit IOPS to 500 reads per second on the same device
echo "8:0 riops=500" >> io.max

The io.stat file reports real‑time statistics in a format similar to procfs:

8:0 RBytes=12345678 WBytes=87654321 RIOs=3456 WIOs=4321

These granular controls enable workloads such as databases to enforce QoS guarantees without kernel patches.

CPU and Scheduling

The cpu controller now uses the cpu.max file to express both quota and period in a single line:

# Limit the cgroup to 20 % of a single CPU (200 ms of 1 s)
echo "20000 100000" > cpu.max

When cpu.max is set to max, the cgroup receives no throttling, mirroring the “unlimited” semantics of v1 but with clearer syntax. The cpu.weight file (range 1‑10000) replaces the old cpu.shares model, providing a linear weighting system that is easier to reason about.

Security and Delegation

Delegation is now explicit. A parent cgroup can enable a child controller via cgroup.subtree_control:

# Enable the memory and io controllers for all descendants
echo "+memory +io" > cgroup.subtree_control

Only after a controller is listed can a child cgroup write to its corresponding files. This prevents accidental privilege escalation and aligns with the principle of least privilege.

Real-World Impact on Containers and Cloud Workloads

Predictable Limits for Pods

Kubernetes maps each pod to a dedicated cgroup v2 subtree. Because all resources are co‑located, the scheduler’s “resource request” and “limit” fields translate directly into cpu.max, memory.max, and io.max entries. This eliminates the “double‑bookkeeping” that previously required kubelet to maintain both v1 hierarchies and a separate bookkeeping layer.

Faster Startup Times

V2’s unified accounting reduces the number of syscalls required to set up a container. Benchmarks from the CNCF show a 15 % reduction in container start latency on a 4‑core machine when using v2 compared to v1, primarily due to fewer cgroup.procs writes.

Better Multi‑Tenant Isolation

With per‑cgroup I/O throttling, a noisy neighbor can no longer saturate a block device and starve other tenants. Cloud providers such as Google Cloud Platform have begun exposing cgroup v2 limits to customers via the “gVisor” sandbox, enabling fine‑grained QoS guarantees for serverless functions.

Integration with Systemd and Kubernetes

Systemd’s Native Support

Systemd adopted cgroup v2 as its default in version 239. The Delegate= directive in unit files now maps directly to cgroup.subtree_control, allowing services to manage their own children without compromising the host’s security model:

[Service]
Delegate=yes
CPUQuota=20%
MemoryMax=2G
IOReadBandwidthMax=/dev/sda 50M

Systemd also populates system.slice and user.slice hierarchies with the v2 layout, simplifying debugging: systemd-cgls shows a single tree rather than a forest of v1 mounts.

Kubernetes CRI Integration

The Container Runtime Interface (CRI) for Docker, containerd, and CRI‑O has been updated to mount a v2 cgroup hierarchy per pod. The runtimeClass field can now request a specific io.max profile, enabling workload‑specific I/O policies without custom admission controllers.

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: high‑iops
handler: runc
overhead:
  podFixed:
    cpu: "200m"
    memory: "256Mi"
    io:
      max: "8:0 riops=2000 wiops=1000"

This declarative approach reduces operational friction and aligns with the “policy as code” paradigm.

Migration Strategies and Compatibility

Detecting v2 Availability

Before attempting migration, check the kernel version and mount options:

$ mount | grep cgroup2
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)

If only v1 is present, you can enable v2 alongside v1 by adding cgroup_no_v1=all to the kernel command line and remounting:

# Add to GRUB_CMDLINE_LINUX in /etc/default/grub
GRUB_CMDLINE_LINUX="... cgroup_no_v1=all"

# Update grub and reboot
sudo update-grub && sudo reboot

Incremental Transition

Many distributions ship a hybrid mode where both v1 and v2 are mounted. Tools like systemd can be instructed to use the v2 hierarchy for new units while keeping legacy services on v1:

# In /etc/systemd/system.conf
DefaultControllers=cpu memory io

Gradually migrate services by updating their unit files to use the v2 directives discussed earlier. Verify resource enforcement with systemd-cgtop and cgroupfs utilities.

Compatibility Pitfalls

  • Legacy Applications – Some older daemons read v1-specific files (e.g., memory.limit_in_bytes). Provide a compatibility shim or update the daemon.
  • Nested Containers – Running Docker inside a Docker container (Docker‑in‑Docker) may require the inner daemon to be started with --cgroup-parent pointing to a v2 subtree.
  • SELinux Labels – When delegating cgroups, ensure SELinux policies allow writes to cgroup.subtree_control; otherwise, you’ll see AVC denials.

Testing in a staging environment with realistic workloads is essential before a production cut‑over.

Key Takeaways

  • Cgroups v2 replaces fragmented hierarchies with a single, unified tree, dramatically simplifying resource management.
  • The new io, memory, and cpu controllers provide finer‑grained limits, enabling true QoS for containers and cloud workloads.
  • Delegation and security are built into the API via cgroup.subtree_control, reducing the risk of accidental privilege escalation.
  • Integration with systemd and Kubernetes is now first‑class, allowing declarative, per‑pod resource policies without extra tooling.
  • Migration can be performed incrementally; most modern distributions already enable v2 by default, but careful testing is still required for legacy services.

Further Reading