TL;DR — cgroups v2 gives you fine‑grained, hierarchical resource controls; by structuring controllers, using a unified hierarchy, and wiring them into systemd/Kubernetes you can guarantee performance SLAs and avoid noisy‑neighbor issues in production.
Resource isolation is no longer a nice‑to‑have feature; it is a production prerequisite. Whether you run a fleet of micro‑services on Kubernetes, a batch processing cluster on bare metal, or a container‑heavy CI pipeline, uncontrolled CPU, memory, and I/O can cascade into latency spikes and costly outages. The Linux kernel’s second‑generation control groups (cgroups v2) provide a single, unified hierarchy that simplifies the mental model while delivering more precise throttling, accounting, and delegation. This post walks through the architecture, shows how to embed cgroups v2 into systemd and Kubernetes, and equips you with concrete patterns and monitoring tricks that have proved reliable at scale.
Why cgroups v2 Matters for Production
Unified hierarchy eliminates fragmentation
cgroups v1 required mounting a separate filesystem for each controller (e.g., cpu, memory, blkio). In a large cluster this produced a tangled forest of mount points, making it easy to mis‑configure delegation or forget a controller entirely. cgroups v2 collapses all controllers under a single mount point (/sys/fs/cgroup), guaranteeing that every task inherits the same set of controllers unless explicitly disabled. This uniformity reduces configuration drift—a common source of production incidents.
Stronger guarantees through “thread‑aware” accounting
The kernel now tracks per‑thread CPU time and memory pressure at the cgroup level, enabling policies such as “no single thread can consume more than 30 % of the group’s CPU share”. This is especially valuable for Java or Go workloads that spawn many worker threads.
Built‑in delegation for containers
cgroups v2 introduces the cgroup.subtree_control file, allowing a parent to hand down a subset of controllers to children without exposing the entire hierarchy. Containers launched by Docker, Podman, or CRI‑O can be given only the controllers they need, tightening the attack surface and simplifying security audits.
cgroups v2 Architecture Overview
The core of cgroups v2 lives in a single pseudo‑filesystem. The most relevant files are:
| File / Directory | Purpose |
|---|---|
cgroup.controllers | Lists controllers available on the system (e.g., cpu, memory). |
cgroup.subtree_control | Enables/disables controllers for child cgroups. |
cgroup.procs | PIDs belonging to the cgroup. |
cpu.max | max quota period pair, e.g., 50000 100000 → 50 % CPU. |
memory.max | Hard memory limit in bytes. |
io.max | Per‑device I/O throttling, expressed as major:minor rw bytes. |
pressure files (cpu.pressure, memory.pressure) | Quantify contention over time windows. |
All controllers share the same syntax for “max” values: a number or the string max to indicate “no limit”. This consistency makes it easy to generate policy files programmatically.
Example: Inspecting the root hierarchy
$ mount -t cgroup2 none /sys/fs/cgroup
$ cat /sys/fs/cgroup/cgroup.controllers
cpu cpuacct io memory pids
$ ls -l /sys/fs/cgroup/
total 0
drwxr-xr-x 3 root root 0 May 29 19:00 cpu
drwxr-xr-x 3 root root 0 May 29 19:00 io
drwxr-xr-x 3 root root 0 May 29 19:00 memory
drwxr-xr-x 3 root root 0 May 29 19:00 pids
Notice the single mount point and the uniform controller list.
Patterns in Production
1. Systemd Slice per Service
Systemd is the de‑facto init system on most modern Linux distributions and already integrates with cgroups v2. By defining a slice you can give an entire service family its own resource envelope.
# /etc/systemd/system/webapp.slice
[Slice]
# Enable the controllers we care about
Controllers=cpu memory io
# Allow children to inherit these controllers
# (systemd writes to cgroup.subtree_control automatically)
Then attach a service to the slice:
# /etc/systemd/system/webapp.service
[Unit]
Description=Web Application
After=network.target
[Service]
Slice=webapp.slice
ExecStart=/usr/local/bin/webapp
# Fine‑grained limits
CPUQuota=40%
MemoryMax=2G
IOReadBandwidthMax=/dev/sda 10M
Systemd writes the appropriate cpu.max, memory.max, and io.max files under /sys/fs/cgroup/webapp.slice. Because the slice is a first‑class cgroup, you can query its stats with systemd-cgtop or systemd-run --property=....
2. Delegating to Containers via CRI
Kubernetes delegates container creation to a Container Runtime Interface (CRI) implementation such as containerd or CRI‑O. Both expose a runtime class that can request a specific cgroup hierarchy.
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: cgroupv2-highcpu
handler: cri-o
scheduling:
nodeSelector:
kubernetes.io/arch: amd64
overhead:
podFixed:
cpu: "500m"
memory: "256Mi"
When a pod specifies runtimeClassName: cgroupv2-highcpu, the CRI creates a child cgroup under the node’s kubepods.slice and writes cpu.max based on the pod’s request + overhead. This pattern lets you enforce per‑tenant caps without custom admission controllers.
3. Per‑Tenant Sub‑trees in Multi‑Tenant SaaS
A SaaS platform may allocate each tenant its own cgroup subtree:
/sys/fs/cgroup/tenants/
├─ tenant-a/
│ ├─ cpu.max = "20000 100000" # 20 % of a CPU
│ └─ memory.max = "4G"
└─ tenant-b/
├─ cpu.max = "50000 100000" # 50 %
└─ memory.max = "8G"
The platform’s scheduler (a custom Go service) writes these files via the cgroups Go library. Because the hierarchy is static, you can audit limits with a single find command, and you can hand off the subtree to a container runtime by setting cgroup.subtree_control=+cpu +memory.
Implementation Strategies
Mounting cgroups v2 Early
A production host should mount cgroups v2 before any userland services start, typically from the initramfs or early in /etc/fstab. Example entry:
cgroup2 /sys/fs/cgroup cgroup2 defaults,noexec,nosuid,nodev 0 0
If you need both v1 and v2 (for legacy tools), enable the hybrid mode:
# Enable both hierarchies
sysctl -w kernel.controllers=cpu,cpuacct,io,memory,pids
mount -t cgroup2 -o nsdelegate none /sys/fs/cgroup/unified
Hybrid mode is discouraged for new deployments because it re‑introduces the fragmentation we aim to eliminate.
Automating Policy Generation
Large fleets benefit from a policy-as-code approach. Below is a minimal Python script that reads a JSON spec and writes the appropriate cgroup files.
#!/usr/bin/env python3
import json, pathlib, os
SPEC_PATH = "/etc/cgroup-policies/tenants.json"
BASE = pathlib.Path("/sys/fs/cgroup/tenants")
def apply_policy(tenant, cfg):
cg = BASE / tenant
cg.mkdir(parents=True, exist_ok=True)
# Enable controllers for the subtree
(cg / "cgroup.subtree_control").write_text("+cpu +memory +io")
# Apply limits
(cg / "cpu.max").write_text(f"{cfg['cpu_quota']} {cfg['cpu_period']}")
(cg / "memory.max").write_text(str(cfg['memory_max']))
if 'io' in cfg:
io_line = f"{cfg['io']['dev']} {cfg['io']['type']} {cfg['io']['limit']}"
(cg / "io.max").write_text(io_line)
if __name__ == "__main__":
with open(SPEC_PATH) as f:
data = json.load(f)
for tenant, cfg in data.items():
apply_policy(tenant, cfg)
Deploy this script with a systemd timer that runs every 5 minutes, ensuring that any drift caused by manual tinkering is corrected automatically.
Integrating with Observability Stack
cgroups v2 exposes pressure metrics that can be scraped by Prometheus using the node_exporter collector cgroup. Example Prometheus rule to alert on memory pressure:
- alert: HighMemoryPressure
expr: node_memory_pressure_seconds_total{type="memory", mode="some"} > 0.8
for: 2m
labels:
severity: warning
annotations:
summary: "Memory pressure > 80% on {{ $labels.instance }}"
description: "The cgroup {{ $labels.cgroup }} is experiencing sustained memory pressure."
Couple this with Grafana dashboards that plot cpu.max vs. actual CPU usage (node_cpu_seconds_total). The visual correlation quickly reveals whether a limit is too tight or if a noisy neighbor is stealing cycles.
Monitoring & Debugging in Production
Real‑time inspection with systemd-cgtop
$ systemd-cgtop
Path CPU% Memory Current Memory Limit
/ 0.00% 0B -
/system.slice 2.13% 1.2G / 32G -
/user.slice 0.00% 0B -
/webapp.slice 15.7% 1.6G / 2G -
systemd-cgtop aggregates per‑slice usage, letting you spot runaway services instantly.
Using cgroupfs tools
The cgroupfs-mount utility from the cgroupfs-mount project can verify that the hierarchy is correctly mounted and that all controllers are active.
$ cgroupfs-mount -v
cgroup2 mounted on /sys/fs/cgroup (controllers: cpu io memory pids)
If you see missing controllers, check the kernel config (CONFIG_CGROUP_*) or the cgroup.controllers file.
Debugging I/O throttling
When io.max limits cause latency, the iostat command can be combined with cgroup stats:
$ cat /sys/fs/cgroup/webapp.slice/io.stat
8:0 rbytes=104857600 wbytes=52428800 rios=5000 wios=2500
Cross‑reference iostat -x for the underlying device to confirm whether the limit is the bottleneck or the storage subsystem itself.
Performance Tuning Tips
- Prefer
cpu.maxovercpu.shares– The former provides an absolute quota, eliminating the “share‑based” ambiguity that caused the “bursty” behaviour in v1. - Set
memory.highin addition tomemory.max–memory.hightriggers reclamation before the hard limit, smoothing out sudden spikes. - Leverage
pids.maxfor fork‑bomb protection – In environments that run untrusted code (CI runners), limiting the number of processes per cgroup prevents resource exhaustion. - Combine
cpu.idlewithcpu.max– Enablingcpu.idlelets the kernel put idle tasks into a low‑power state, freeing cycles for other groups without changing quotas. - Use
cgroup.freezefor graceful draining – When rolling out a new version, freeze the old cgroup, wait for in‑flight requests to finish, then destroy it.
Key Takeaways
- cgroups v2 unifies all controllers under a single hierarchy, simplifying configuration and delegation.
- Systemd slices and Kubernetes runtime classes are the most production‑ready ways to apply limits consistently.
- Delegating controllers via
cgroup.subtree_controlreduces the attack surface and prevents accidental over‑provisioning. - Real‑time tools (
systemd-cgtop,node_exporter) and pressure metrics give you early warning of contention. - Policy‑as‑code (e.g., Python script) automates limit enforcement and makes audits trivial.
- Fine‑tuning (
cpu.max,memory.high,pids.max) yields predictable performance without sacrificing flexibility.