TL;DR — cgroups v2 provides a unified, fine‑grained way to isolate CPU, memory, I/O, and more. By wiring it into systemd, Docker, and Kubernetes you can enforce strict limits, avoid noisy neighbor problems, and gain predictable performance at scale.
Resource isolation is no longer a luxury; it’s a necessity for any production environment that runs multi‑tenant services, high‑throughput pipelines, or latency‑sensitive workloads. While most engineers are familiar with the legacy cgroups v1 interface, the kernel’s cgroups v2 redesign delivers a cleaner hierarchy, richer controllers, and deterministic accounting. This post walks through the architecture of cgroups v2, shows step‑by‑step implementation on modern Linux distributions, and presents performance data from a Kafka‑driven streaming stack. By the end you’ll have a production‑ready recipe for leveraging cgroups v2 across Docker and Kubernetes.
Why Cgroups v2 Matters
- Unified hierarchy – Unlike v1, where each controller could mount its own tree, v2 enforces a single, consistent hierarchy. This eliminates the “split‑brain” situation where a process lives under one controller’s limits but not another’s.
- Better accounting – Per‑controller statistics are now exposed via a single
cgroup.eventsfile, and memory pressure notifications are more accurate, helping autoscaling decisions. - Future‑proof – New controllers (e.g.,
io_uring) are added without breaking existing tooling, and the kernel community has declared v2 the default for all distributions after kernel 5.4.
Production teams that continue to rely on v1 often hit hidden bugs when mixing controllers or when upgrading the kernel. Switching to v2 gives you a single source of truth for resource caps, which translates directly into lower tail latency and fewer OOM incidents.
Architecture Overview
Hierarchy and Controllers
cgroups v2 builds a single unified tree rooted at /sys/fs/cgroup. Each node in the tree is a cgroup that can host any enabled controller. Controllers are kernel modules that enforce limits:
| Controller | Primary Metric | Typical Use‑Case |
|---|---|---|
cpu | CPU time (shares, quota) | Guarantee CPU for latency‑critical pods |
memory | RSS, swap, OOM events | Prevent noisy neighbor memory spikes |
io | Block I/O weight, throttle | Isolate disk intensive ETL jobs |
pids | Max process count | Guard against fork‑bombs in user‑submitted code |
pressure | Pressure stall information (PSI) | Feed real‑time alerts to SRE dashboards |
All controllers expose their state through files in the cgroup directory, e.g., cpu.max, memory.max, io.max. The unified model means you can set cpu.max and memory.max in the same cgroup without worrying about mismatched mounts.
Unified vs Legacy Mounts
In v1 you would see multiple mount points:
mount | grep cgroup
sysfs on /sys/fs/cgroup type sysfs (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/cpu type cgroup (rw,relatime,cpu)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,relatime,memory)
With v2 it collapses to a single mount:
mount | grep cgroup2
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
The kernel automatically enables the controllers that are compiled in and not explicitly disabled via the cgroup.subtree_control file. This simplification reduces the operational overhead of managing multiple mount namespaces.
Implementation in Production
Configuring Controllers via systemd
Most modern distributions ship systemd as PID 1, and systemd is the recommended entry point for cgroups v2. Systemd creates a slice for each service and populates cgroup.subtree_control based on the unit file’s CPUQuota, MemoryMax, etc.
Example unit file for a high‑throughput microservice:
[Unit]
Description=Realtime Ingestion Service
After=network.target
[Service]
ExecStart=/usr/local/bin/ingest --config /etc/ingest.yaml
# Enforce 2 CPUs, 4 GiB RAM, and I/O weight of 500 (out of 1000)
CPUQuota=200%
MemoryMax=4G
IOWeight=500
# Prevent runaway process creation
TasksMax=200
# Ensure cgroup v2 subtree control propagates to child processes
Delegate=yes
Systemd writes the appropriate cpu.max, memory.max, and io.weight files under /sys/fs/cgroup/<slice>/. Verify with:
cat /sys/fs/cgroup/ingest.service/cpu.max
cat /sys/fs/cgroup/ingest.service/memory.max
cat /sys/fs/cgroup/ingest.service/io.weight
Integration with Docker
Docker 20.10+ defaults to cgroups v2 when the kernel supports it. You can explicitly enable it with the daemon flag --cgroupns=host and --default-cgroupns-mode=private. The docker run CLI then accepts the same resource flags as systemd, but they translate directly to v2 files.
docker run -d \
--name analytics \
--cpus=1.5 \
--memory=2g \
--pids-limit=100 \
--blkio-weight=300 \
myorg/analytics:latest
Docker creates a cgroup slice under /sys/fs/cgroup/docker/<container-id>/. To inspect:
docker inspect -f '{{.Id}}' analytics | xargs -I{} cat /sys/fs/cgroup/docker/{}/cpu.max
Integration with Kubernetes
Kubernetes 1.27+ ships a cgroupfs runtime class that automatically uses v2. The ResourceQuota and LimitRange objects still map to the same semantics, but the kubelet now writes directly to the v2 files.
apiVersion: v1
kind: Pod
metadata:
name: streaming-worker
spec:
containers:
- name: worker
image: myorg/worker:stable
resources:
limits:
cpu: "2"
memory: "6Gi"
requests:
cpu: "1"
memory: "2Gi"
securityContext:
# Enable cgroup v2 delegation
privileged: false
You can verify the cgroup layout on the node:
kubectl exec streaming-worker -- cat /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod*/cpu.max
Performance Benchmarking
Test Harness
To evaluate the impact of cgroups v2 on a Kafka consumer group, we built a reproducible benchmark using the kafka-producer-perf-test.sh and kafka-consumer-perf-test.sh scripts from Apache Kafka 3.5.0, wrapped inside Docker containers that each inherit a dedicated cgroup.
The harness runs three scenarios:
- No limits – Baseline where containers run unrestricted.
- CPU‑only limit –
cpu.max=200000 1000000(200 ms of CPU per second). - CPU + Memory limit – CPU as above,
memory.max=4G.
Each scenario processes 10 GiB of synthetic messages (average size 500 bytes) at a target throughput of 1 MiB/s.
Results and Tuning
| Scenario | Avg Latency (ms) | 99th‑pct Latency (ms) | CPU Utilization | Memory Pressure Events |
|---|---|---|---|---|
| No limits | 12 | 18 | 85 % | 0 |
| CPU‑only | 14 | 22 | 70 % (capped) | 0 |
| CPU + Memory | 15 | 24 | 70 % | 3 (PSI = 0.03) |
Key observations:
- Predictable CPU usage – The
cpu.maxsetting capped the container at 20 % of a single core, yet the throughput remained within 20 % of baseline because Kafka’s internal batching absorbed the throttling. - Memory pressure visibility – With
memory.max=4G, the kernel emitted PSI events that we routed to Prometheus vianode_exporter. The occasional pressure spikes correlated with GC pauses in the Java consumer. - No noisy neighbor effect – When we ran a second, I/O‑heavy container (large
io.maxthrottle) on the same node, the Kafka pod’s latency stayed unchanged, confirming the isolation guarantees of theiocontroller.
Tuning Tips
- Use
cpu.maxwith a quota and period pair (max period) to achieve sub‑core granularity. - Pair
memory.maxwithmemory.swap.maxto prevent fallback to swap, which can dramatically increase tail latency. - Enable
pressurecontroller (cgroup.subtree_control="+pressure") and forward PSI metrics to your SLO dashboard for early detection of resource contention.
Patterns in Production
Multi‑tenant SaaS Platform
A SaaS provider running thousands of user‑submitted Python notebooks found that a single rogue notebook could exhaust the host’s memory, causing OOM kills for unrelated tenants. By sandboxing each notebook in its own cgroup v2 slice with memory.max=2G and pids.max=150, they eliminated cross‑tenant crashes. Systemd’s Delegate=yes allowed the notebook runner to spawn child processes while still respecting the parent slice’s limits.
Real‑time Streaming (Kafka)
In a financial‑services firm, latency budgets are measured in microseconds. They deployed a dedicated Kafka broker per market data feed, each confined to a cgroup with:
echo "50000 1000000" > /sys/fs/cgroup/kafka-broker-nyc/cpu.max # 5% of a core
echo "8G" > /sys/fs/cgroup/kafka-broker-nyc/memory.max
echo "+io +cpu +memory +pressure" > /sys/fs/cgroup/kafka-broker-nyc/cgroup.subtree_control
The broker’s io.max was tuned to 10 MiB/s per device, preventing a bursty consumer from saturating the SSD. The result: sub‑millisecond end‑to‑end latency even under peak load.
Batch Processing Farm
A data‑engineering team migrated their nightly Spark jobs from v1 to v2. They added a cgroup.subtree_control="+cpu +memory" line to the Spark executor launch script, then set per‑job limits via spark.executor.memory=4g and spark.executor.cores=2. The unified hierarchy eliminated the “memory‑controller not mounted” errors that previously required custom scripts to bind each executor to a separate v1 hierarchy.
Key Takeaways
- Unified hierarchy simplifies operations: one mount point, one set of files, no split‑brain bugs.
- systemd is the natural orchestration layer; use
Delegate=yesfor child processes that need their own limits. - Docker and Kubernetes already ship with v2 support; just verify the kernel version (
uname -r≥ 5.4) and daemon flags. - Performance isolation is measurable: CPU throttling and memory pressure events translate directly into latency SLOs.
- Production patterns—sandboxed notebooks, per‑broker Kafka slices, Spark executor cgroups—demonstrate real‑world value.