TL;DR — cgroups v2 replaces the fragmented v1 tree with a single, unified hierarchy, making resource isolation predictable. By enabling the kernel flag, creating controllers via the cgroup2 filesystem, and applying per‑controller limits (cpu.max, memory.max, io.max), you can achieve production‑grade performance isolation for containers, VMs, or bare‑metal services.

Resource isolation is the backbone of modern cloud‑native workloads. While most engineers interact with cgroups indirectly through Docker, Kubernetes, or systemd, a solid grasp of cgroups v2 lets you troubleshoot noisy neighbors, fine‑tune latency‑sensitive services, and avoid costly over‑provisioning. This post walks you through the architecture, shows how to enable and use the unified hierarchy on a real host, and provides a checklist of performance‑tuning patterns you can drop into any production pipeline.

Why cgroups v2 Matters

  • Unified hierarchy – All controllers live under a single mount point (/sys/fs/cgroup), eliminating the “multiple trees” problem of v1 where CPU, memory, and blkio could be on different branches.
  • Simplified semantics – Controllers expose a consistent set of files (max, low, high, stat) that are easier to script.
  • Better kernel enforcement – v2’s “leaf‑only” placement guarantees that resource limits are applied exactly where you intend, reducing unexpected inheritance.
  • Future‑proof – New controllers (e.g., pids, rdma) are added without breaking existing tooling, and the kernel’s default for most distributions is now v2.

As highlighted in the Linux kernel documentation, the move to v2 is not optional for any organization that wants deterministic performance at scale.

Architecture of the Unified Hierarchy

Controllers Overview

ControllerPrimary ResourceTypical File(s)Example Limit
cpuCPU time / bandwidthcpu.max, cpu.weightcpu.max="50000 100000" (50 % of a single core)
memoryRAM usagememory.max, memory.lowmemory.max="2G"
ioBlock I/O throttlingio.maxio.max="8:0 rbps=10M wbps=5M"
pidsProcess countpids.maxpids.max=100
hugetlbHuge page allocationhugetlb.maxhugetlb.max=1G

Each controller writes its state to a flat file inside the cgroup directory. The kernel reads these files on every scheduler tick, making the enforcement latency sub‑millisecond.

Filesystem Layout

When you mount the unified hierarchy, the tree looks like:

/sys/fs/cgroup/
├─ cpu.max
├─ cpu.weight
├─ memory.max
├─ memory.low
├─ io.max
├─ pids.max
└─ myservice/
   ├─ cpu.max
   ├─ memory.max
   └─ io.max

The root node holds default limits that cascade to children unless overridden. This “inherit‑unless‑set” model mirrors how systemd slices work, which is why many production teams prefer systemd‑unit‑based cgroup creation.

Implementing cgroups v2 in Production

Enabling v2 on Modern Distros

Most recent Ubuntu, Fedora, and RHEL releases ship with v2 enabled by default. Verify with:

$ stat -fc %T /sys/fs/cgroup
cgroup2

If you see cgroup (v1), add the kernel boot parameter cgroup_no_v1=all and rebuild the initramfs:

# echo "GRUB_CMDLINE_LINUX=\"cgroup_no_v1=all\"" >> /etc/default/grub
# update-grub
# reboot

After reboot, re‑run the stat command to confirm the switch.

Creating and Managing Groups Manually

Suppose you want to isolate a CPU‑bound batch job called image-processor. The steps are:

# mkdir -p /sys/fs/cgroup/image-processor
# echo "+cpu +memory +io" > /sys/fs/cgroup/cgroup.subtree_control
# echo "50000 100000" > /sys/fs/cgroup/image-processor/cpu.max   # 50 % of one core
# echo "2G" > /sys/fs/cgroup/image-processor/memory.max
# echo "8:0 rbps=20M wbps=10M" > /sys/fs/cgroup/image-processor/io.max
# echo $$ > /sys/fs/cgroup/image-processor/cgroup.procs   # move current shell into the group

Key points:

  • The cgroup.subtree_control file must list the controllers you intend to enable once, at the parent where you create children.
  • Limits are expressed as max values; a missing second field for cpu.max means “no period limit”, which defaults to 100 ms.
  • cgroup.procs transfers the process ID into the new group, instantly applying the limits.

Systemd Integration (Production‑Ready)

Most production services run as systemd units. Systemd abstracts the low‑level file writes with unit directives:

# /etc/systemd/system/image-processor.service
[Unit]
Description=Image Processor Batch Job

[Service]
ExecStart=/usr/local/bin/image-processor
CPUQuota=50%
MemoryMax=2G
IOReadBandwidthMax=/dev/sda 20M
IOWriteBandwidthMax=/dev/sda 10M
# Optional: isolate PID count
TasksMax=100

[Install]
WantedBy=multi-user.target

After reloading systemd, the service automatically appears under /sys/fs/cgroup/system.slice/image-processor.service/ with the same files as the manual approach, but with proper ordering and clean‑up on stop.

Performance Tuning Patterns

CPU Bandwidth & Max

  • cpu.max vs cpu.quotacpu.max is the v2 equivalent of the v1 cpu.cfs_quota_us. Use the period/ quota pair to enforce hard caps on bursty workloads.
  • Weight‑based sharing – When you need relative fairness rather than absolute caps, set cpu.weight (range 1–10000). A weight of 100 gives roughly 1 % of CPU time on a fully saturated node.
# Example: give a low‑priority service 5 % of CPU
echo "500" > /sys/fs/cgroup/lowprio/cpu.weight

Memory High/Low

  • memory.max – Hard limit; the OOM killer terminates processes that exceed it.
  • memory.low – Soft guarantee; the kernel preferentially protects this amount from reclaim under pressure, useful for latency‑critical caches.
# Reserve 256 MiB for a latency‑critical cache, but allow up to 1 GiB total
echo "256M" > /sys/fs/cgroup/cache/memory.low
echo "1G"   > /sys/fs/cgroup/cache/memory.max

Monitoring memory.stat gives insight into inactive_file, active_anon, etc., enabling data‑driven adjustments.

I/O Throttling with blkio

The io.max file accepts a per‑device token bucket. Syntax: <major:minor> <op>=<rate> where <op> is rbps, wbps, riops, or wiops.

# Limit a database to 50 MiB/s reads and 30 MiB/s writes on /dev/nvme0n1
echo "259:0 rbps=50M wbps=30M" > /sys/fs/cgroup/db/io.max

Tip: Use lsblk -d -o MAJ:MIN,NAME to map block devices to major/minor numbers.

PID & Process Count

For services that spawn many short‑lived workers (e.g., web servers), pids.max protects the host from fork‑bomb style exhaustion.

echo "200" > /sys/fs/cgroup/web/pids.max

When the limit is reached, fork() returns EAGAIN, which many languages surface as an exception you can catch and log.

Monitoring and Observability

A production‐grade stack typically scrapes cgroup metrics with Prometheus. The node_exporter collector cgroup exposes:

  • node_cgroup_cpu_seconds_total
  • node_cgroup_memory_usage_bytes
  • node_cgroup_io_service_bytes_total

You can also push custom metrics via a tiny Python helper:

#!/usr/bin/env python3
import pathlib, time

CGROUP = pathlib.Path("/sys/fs/cgroup/image-processor")
def read_metric(file):
    return (CGROUP / file).read_text().strip()

while True:
    cpu = read_metric("cpu.stat")
    mem = read_metric("memory.current")
    print(f"cpu={cpu} mem={mem}")
    time.sleep(5)

Integrate the output into Grafana dashboards, set alerts on memory.current > memory.max * 0.9, and you have a proactive isolation guardrail.

Key Takeaways

  • Unified hierarchy removes cross‑controller inconsistencies; mount /sys/fs/cgroup once and enable needed controllers via cgroup.subtree_control.
  • Controller files are the API – use cpu.max, memory.max, io.max, etc., to impose hard caps; cpu.weight and memory.low provide soft guarantees.
  • Systemd is the production‑grade orchestrator – embed limits in unit files to guarantee cleanup and avoid manual cgroup.procs fiddling.
  • Tune per‑service – start with modest caps, monitor *.stat files, and iteratively tighten limits based on observed burst patterns.
  • Observability matters – export cgroup metrics to Prometheus/Grafana and set alerts before a limit triggers OOM or throttling.

Further Reading