TL;DR — cgroups v2 replaces the fragmented v1 tree with a single, unified hierarchy and a set of modular controllers. By mastering its mount options, controller activation, and per‑cgroup configuration files, you can achieve deterministic CPU, memory, I/O, and device isolation for containers, VMs, or any Linux workload.

Resource isolation has become a non‑negotiable requirement for modern cloud‑native workloads. While cgroups v1 served us well for a decade, its scattered hierarchy and controller‑specific quirks made large‑scale tuning a nightmare. This post unpacks the unified hierarchy, walks through each control controller, and shows how production teams at Netflix, Shopify, and Red Hat have baked cgroups v2 into their deployment pipelines.

Unified Hierarchy Overview

cgroups v2 collapses the eight (or more) separate v1 hierarchies into a single virtual filesystem mounted at /sys/fs/cgroup. All enabled controllers operate side‑by‑side on the same tree, which eliminates cross‑hierarchy race conditions and simplifies bookkeeping.

What Changed from v1

Aspectcgroups v1cgroups v2
HierarchyOne mount per controller (e.g., cpu, memory)Single mount, all enabled controllers share it
Controller activationImplicit when a controller is mountedExplicit via cgroup.subtree_control file
Resource distributioncpu.shares, memory.limit_in_bytes per hierarchyUnified cpu.max, memory.max per cgroup
Process placementcgroup.procs file per hierarchySingle cgroup.procs file per node in the unified tree

The single mount means you can now reason about a workload’s resource envelope in one place, and you no longer need to remember which hierarchy a given controller lives under.

Mounting the Unified Hierarchy

On a modern distro with kernel ≥ 5.4, the unified hierarchy is usually enabled by default. If you need to enforce it manually:

# Bash – mount cgroup2 as the only cgroup filesystem
sudo mount -t cgroup2 none /sys/fs/cgroup

To make the mount persistent across reboots, add the following line to /etc/fstab:

none    /sys/fs/cgroup    cgroup2    defaults    0   0

Note – The kernel parameter systemd.unified_cgroup_hierarchy=1 forces systemd to bootstrap the unified hierarchy early in the boot process. See the systemd docs for details.

Control Controllers Deep Dive

cgroups v2 ships with a core set of controllers that can be turned on or off per subtree. The most common ones for production workloads are cpu, memory, io, pids, and cpuset. Below we explore each controller’s syntax, typical usage patterns, and pitfalls.

CPU Controller

The CPU controller in v2 uses bandwidth‑based throttling rather than the share‑based model of v1. Two files are relevant:

  • cpu.max – defines the hard quota and period.
  • cpu.weight – a relative priority used when the quota is not set.

Setting a hard quota

# Bash – limit a cgroup to 200 ms of CPU every 1 s (20% of a single core)
echo "200000 1000000" > /sys/fs/cgroup/myservice/cpu.max

The values are in microseconds. If you need a fractional number of cores, multiply the desired share by the period (default 100 ms) and write that as the quota.

Using weight for best‑effort scheduling

# Bash – give this cgroup a weight of 200 (default is 100)
echo 200 > /sys/fs/cgroup/myservice/cpu.weight

Higher weight translates to more CPU time when the system is under contention, but it does not enforce a hard limit.

Memory Controller

Memory isolation is expressed with two files:

  • memory.max – the absolute limit in bytes.
  • memory.high – a soft threshold that triggers reclamation but does not kill the cgroup.

Hard limit example

# Bash – cap the cgroup at 2 GiB
echo $((2 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/myservice/memory.max

Soft limit and OOM handling

# Bash – set a high watermark at 1.5 GiB
echo $((1536 * 1024 * 1024)) > /sys/fs/cgroup/myservice/memory.high

When usage exceeds memory.high, the kernel starts reclaiming pages from the cgroup before hitting memory.max. If memory.max is breached, the cgroup receives an OOM kill, just like a process‑level OOM.

Caution – Do not set memory.max to “max” (the string) if you plan to later enforce a limit; the kernel will treat it as unlimited and you’ll lose the ability to enforce a hard cap without remounting.

I/O Controller (io)

The I/O controller works on a per‑device basis, using the io.max file. It accepts weighted I/O (rbps, wbps) and IOPS limits.

Limiting a block device

# Bash – limit /dev/sda to 10 MiB/s reads and 5 MiB/s writes
echo "8:0 rbps=10485760 wbps=5242880" > /sys/fs/cgroup/myservice/io.max

8:0 is the major:minor number for /dev/sda. You can discover it with lsblk -dno MAJ:MIN /dev/sda.

Combining IOPS and bandwidth

# Bash – 1000 read IOPS and 500 write IOPS on /dev/nvme0n1
echo "259:0 riops=1000 wiops=500" > /sys/fs/cgroup/myservice/io.max

PIDs Controller

The pids controller caps the number of processes (including threads) that can be spawned in a cgroup. This is a safety net against fork bombs.

# Bash – allow at most 200 processes
echo 200 > /sys/fs/cgroup/myservice/pids.max

When the limit is reached, fork() returns EAGAIN, and the offending workload typically logs an error. This is especially useful for untrusted code execution environments.

Cpuset Controller

cpuset pins a cgroup to a specific set of CPUs and memory nodes. It requires two files:

  • cpuset.cpus – list of logical CPUs (e.g., 0-3,8-11).
  • cpuset.mems – list of NUMA memory nodes.

Example: isolate a service to a dedicated CPU socket

# Bash – bind to CPUs 0‑7 (first socket) and NUMA node 0
echo "0-7" > /sys/fs/cgroup/myservice/cpuset.cpus
echo "0"   > /sys/fs/cgroup/myservice/cpuset.mems

When you enable cpuset, you must also enable it on the parent cgroup, otherwise the kernel will reject the write with EINVAL.

Architecture Patterns in Production

cgroups v2 is not a standalone toy; it is the backbone of container runtimes, systemd slices, and Kubernetes pod isolation. Below we outline three proven patterns.

1. Systemd Slice per Service

Systemd automatically creates a slice (myservice.slice) that maps to a dedicated cgroup subtree. By adding CPUQuota=20% and MemoryMax=2G to the unit file, you let systemd translate those directives into the appropriate cpu.max and memory.max files.

# /etc/systemd/system/myservice.service
[Unit]
Description=My high‑throughput API

[Service]
ExecStart=/usr/local/bin/myservice
Slice=myservice.slice
Restart=on-failure

[Slice]
CPUQuota=20%
MemoryMax=2G
IOWeight=500

When the unit starts, systemd writes the values into the unified hierarchy, ensuring consistent enforcement across reboots.

2. Kubernetes Pod Cgroup Integration

Kubernetes 1.27+ enables the cgroup v2 driver (systemd) by default. The kubelet creates a pod‑level cgroup under /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice. Resource requests/limits in the pod spec become cpu.max and memory.max entries.

apiVersion: v1
kind: Pod
metadata:
  name: data‑worker
spec:
  containers:
  - name: worker
    image: myorg/worker:latest
    resources:
      requests:
        cpu: "500m"
        memory: "1Gi"
      limits:
        cpu: "1"
        memory: "2Gi"

The kubelet also configures IO throttling via the io.max controller if the DeviceResources feature gate is enabled.

3. Edge‑Node Isolation with Podman

Podman (v4+) ships with native cgroup v2 support. When you run podman run --cgroup-manager=cgroupfs, Podman creates a dedicated subtree under /sys/fs/cgroup/podman. You can pass fine‑grained limits directly:

podman run -d \
  --name analytics \
  --cpu-quota=200000 --cpu-period=1000000 \
  --memory=2g \
  --pids-limit=150 \
  myorg/analytics:latest

Behind the scenes, Podman writes the same values we demonstrated earlier, giving you CLI‑level parity with systemd slices.

Configuring cgroups v2 in Real Deployments

Below is a step‑by‑step checklist that production teams can embed into their CI/CD pipelines.

  1. Detect unified hierarchy

    if mountpoint -q /sys/fs/cgroup && grep -q cgroup2 /proc/mounts; then
        echo "Unified hierarchy present"
    else
        echo "Fallback to v1 – abort"
        exit 1
    fi
    
  2. Enable required controllers on the root cgroup (only once).

    echo "+cpu +memory +io +pids +cpuset" > /sys/fs/cgroup/cgroup.subtree_control
    
  3. Create a dedicated subtree for the service.

    mkdir /sys/fs/cgroup/myservice
    
  4. Assign processes – echo the PID(s) into cgroup.procs.

    echo $$ > /sys/fs/cgroup/myservice/cgroup.procs   # current shell
    
  5. Apply limits – write to the appropriate controller files (see sections above).

    echo "200000 1000000" > /sys/fs/cgroup/myservice/cpu.max
    echo $((2*1024*1024*1024)) > /sys/fs/cgroup/myservice/memory.max
    echo "8:0 rbps=10485760 wbps=5242880" > /sys/fs/cgroup/myservice/io.max
    echo 200 > /sys/fs/cgroup/myservice/pids.max
    echo "0-3" > /sys/fs/cgroup/myservice/cpuset.cpus
    echo "0"   > /sys/fs/cgroup/myservice/cpuset.mems
    
  6. Validate – read back the files and compare with expectations.

    cat /sys/fs/cgroup/myservice/cpu.max
    cat /sys/fs/cgroup/myservice/memory.max
    
  7. Monitor – use systemd-cgtop or cgroupfs‑compatible tools like cgroup2-tools to watch live usage.

    systemd-cgtop -n 5
    

Automating with Ansible

A minimal Ansible role can enforce the same configuration across a fleet:

# tasks/main.yml
- name: Ensure unified hierarchy is mounted
  mount:
    path: /sys/fs/cgroup
    src: none
    fstype: cgroup2
    state: mounted

- name: Enable controllers on root
  copy:
    dest: /sys/fs/cgroup/cgroup.subtree_control
    content: "+cpu +memory +io +pids +cpuset"
    mode: '0644'

- name: Create service cgroup
  file:
    path: /sys/fs/cgroup/{{ service_name }}
    state: directory
    mode: '0755'

- name: Apply resource limits
  block:
    - copy:
        dest: /sys/fs/cgroup/{{ service_name }}/cpu.max
        content: "{{ cpu_quota }} {{ cpu_period }}"
    - copy:
        dest: /sys/fs/cgroup/{{ service_name }}/memory.max
        content: "{{ memory_limit }}"
    - copy:
        dest: /sys/fs/cgroup/{{ service_name }}/io.max
        content: "{{ io_limits }}"
    - copy:
        dest: /sys/fs/cgroup/{{ service_name }}/pids.max
        content: "{{ pids_limit }}"
    - copy:
        dest: /sys/fs/cgroup/{{ service_name }}/cpuset.cpus
        content: "{{ cpus }}"
    - copy:
        dest: /sys/fs/cgroup/{{ service_name }}/cpuset.mems
        content: "{{ mems }}"

Deploying this role ensures idempotent enforcement of cgroup limits, a pattern widely adopted at companies with thousands of nodes.

Common Pitfalls and Debugging

Even seasoned engineers stumble over a few recurring issues.

SymptomLikely CauseFix
write error: Invalid argument when echoing to cpuset.cpusParent cgroup does not have cpuset enabledEnable +cpuset on the parent’s cgroup.subtree_control first
Process escapes the limit after a rebootLimits were set on a temporary cgroup that disappears on restartPersist configuration via systemd slice, kubelet, or a startup script
OOM kills despite memory.high being setmemory.max is still unlimited (max)Set a finite memory.max value; memory.high alone cannot prevent hard OOM
io.max has no effect on SSDsUnderlying block device does not support throttling (e.g., NVMe with kernel < 5.12)Upgrade kernel or use a different device that implements the blk-mq throttling path
systemd-cgtop shows 0% CPU for a throttled containerCPU quota is too low relative to the period; kernel rounds down to 0Increase the period (second column) or raise the quota proportionally

When in doubt, consult the kernel documentation for the controller you’re tweaking. The official cgroup v2 reference is an excellent source: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html.

Key Takeaways

  • cgroups v2’s unified hierarchy eliminates the fragmentation of v1, letting you manage all resources from a single tree.
  • Controllers (cpu, memory, io, pids, cpuset) use simple flat files (*.max, *.weight, etc.) that can be scripted or managed by systemd/Kubernetes.
  • Production‑grade isolation patterns include systemd slices, Kubernetes pod cgroups, and Podman subtrees.
  • Always enable controllers on the parent before writing to child cgroups; otherwise you’ll hit EINVAL.
  • Persist limits via declarative mechanisms (systemd unit files, kubelet specs, Ansible roles) to survive reboots and node churn.
  • Monitor with systemd-cgtop or specialized tools; validate limits after each deployment to catch regression early.

Further Reading