Mastering cgroups v2 Resource Isolation: Implementation Strategies for Production Workload Performance and Control

TL;DR — cgroups v2 gives you fine‑grained, hierarchical resource controls; by structuring controllers, using a unified hierarchy, and wiring them into systemd/Kubernetes you can guarantee performance SLAs and avoid noisy‑neighbor issues in production.

Resource isolation is no longer a nice‑to‑have feature; it is a production prerequisite. Whether you run a fleet of micro‑services on Kubernetes, a batch processing cluster on bare metal, or a container‑heavy CI pipeline, uncontrolled CPU, memory, and I/O can cascade into latency spikes and costly outages. The Linux kernel’s second‑generation control groups (cgroups v2) provide a single, unified hierarchy that simplifies the mental model while delivering more precise throttling, accounting, and delegation. This post walks through the architecture, shows how to embed cgroups v2 into systemd and Kubernetes, and equips you with concrete patterns and monitoring tricks that have proved reliable at scale.

Why cgroups v2 Matters for Production

Unified hierarchy eliminates fragmentation

cgroups v1 required mounting a separate filesystem for each controller (e.g., cpu, memory, blkio). In a large cluster this produced a tangled forest of mount points, making it easy to mis‑configure delegation or forget a controller entirely. cgroups v2 collapses all controllers under a single mount point (/sys/fs/cgroup), guaranteeing that every task inherits the same set of controllers unless explicitly disabled. This uniformity reduces configuration drift—a common source of production incidents.

Stronger guarantees through “thread‑aware” accounting

The kernel now tracks per‑thread CPU time and memory pressure at the cgroup level, enabling policies such as “no single thread can consume more than 30 % of the group’s CPU share”. This is especially valuable for Java or Go workloads that spawn many worker threads.

Built‑in delegation for containers

cgroups v2 introduces the cgroup.subtree_control file, allowing a parent to hand down a subset of controllers to children without exposing the entire hierarchy. Containers launched by Docker, Podman, or CRI‑O can be given only the controllers they need, tightening the attack surface and simplifying security audits.

cgroups v2 Architecture Overview

The core of cgroups v2 lives in a single pseudo‑filesystem. The most relevant files are:

File / Directory	Purpose
`cgroup.controllers`	Lists controllers available on the system (e.g., `cpu`, `memory`).
`cgroup.subtree_control`	Enables/disables controllers for child cgroups.
`cgroup.procs`	PIDs belonging to the cgroup.
`cpu.max`	`max quota period` pair, e.g., `50000 100000` → 50 % CPU.
`memory.max`	Hard memory limit in bytes.
`io.max`	Per‑device I/O throttling, expressed as `major:minor rw bytes`.
`pressure` files (`cpu.pressure`, `memory.pressure`)	Quantify contention over time windows.

All controllers share the same syntax for “max” values: a number or the string max to indicate “no limit”. This consistency makes it easy to generate policy files programmatically.

Example: Inspecting the root hierarchy

$ mount -t cgroup2 none /sys/fs/cgroup
$ cat /sys/fs/cgroup/cgroup.controllers
cpu cpuacct io memory pids
$ ls -l /sys/fs/cgroup/
total 0
drwxr-xr-x 3 root root 0 May 29 19:00 cpu
drwxr-xr-x 3 root root 0 May 29 19:00 io
drwxr-xr-x 3 root root 0 May 29 19:00 memory
drwxr-xr-x 3 root root 0 May 29 19:00 pids

Notice the single mount point and the uniform controller list.

Patterns in Production

1. Systemd Slice per Service

Systemd is the de‑facto init system on most modern Linux distributions and already integrates with cgroups v2. By defining a slice you can give an entire service family its own resource envelope.

# /etc/systemd/system/webapp.slice
[Slice]
# Enable the controllers we care about
Controllers=cpu memory io
# Allow children to inherit these controllers
# (systemd writes to cgroup.subtree_control automatically)

Then attach a service to the slice:

# /etc/systemd/system/webapp.service
[Unit]
Description=Web Application
After=network.target

[Service]
Slice=webapp.slice
ExecStart=/usr/local/bin/webapp
# Fine‑grained limits
CPUQuota=40%
MemoryMax=2G
IOReadBandwidthMax=/dev/sda 10M

Systemd writes the appropriate cpu.max, memory.max, and io.max files under /sys/fs/cgroup/webapp.slice. Because the slice is a first‑class cgroup, you can query its stats with systemd-cgtop or systemd-run --property=....

2. Delegating to Containers via CRI

Kubernetes delegates container creation to a Container Runtime Interface (CRI) implementation such as containerd or CRI‑O. Both expose a runtime class that can request a specific cgroup hierarchy.

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: cgroupv2-highcpu
handler: cri-o
scheduling:
  nodeSelector:
    kubernetes.io/arch: amd64
overhead:
  podFixed:
    cpu: "500m"
    memory: "256Mi"

When a pod specifies runtimeClassName: cgroupv2-highcpu, the CRI creates a child cgroup under the node’s kubepods.slice and writes cpu.max based on the pod’s request + overhead. This pattern lets you enforce per‑tenant caps without custom admission controllers.

3. Per‑Tenant Sub‑trees in Multi‑Tenant SaaS

A SaaS platform may allocate each tenant its own cgroup subtree:

/sys/fs/cgroup/tenants/
├─ tenant-a/
│  ├─ cpu.max = "20000 100000"   # 20 % of a CPU
│  └─ memory.max = "4G"
└─ tenant-b/
   ├─ cpu.max = "50000 100000"   # 50 %
   └─ memory.max = "8G"

The platform’s scheduler (a custom Go service) writes these files via the cgroups Go library. Because the hierarchy is static, you can audit limits with a single find command, and you can hand off the subtree to a container runtime by setting cgroup.subtree_control=+cpu +memory.

Implementation Strategies

Mounting cgroups v2 Early

A production host should mount cgroups v2 before any userland services start, typically from the initramfs or early in /etc/fstab. Example entry:

cgroup2  /sys/fs/cgroup  cgroup2  defaults,noexec,nosuid,nodev  0  0

If you need both v1 and v2 (for legacy tools), enable the hybrid mode:

# Enable both hierarchies
sysctl -w kernel.controllers=cpu,cpuacct,io,memory,pids
mount -t cgroup2 -o nsdelegate none /sys/fs/cgroup/unified

Hybrid mode is discouraged for new deployments because it re‑introduces the fragmentation we aim to eliminate.

Automating Policy Generation

Large fleets benefit from a policy-as-code approach. Below is a minimal Python script that reads a JSON spec and writes the appropriate cgroup files.

#!/usr/bin/env python3
import json, pathlib, os

SPEC_PATH = "/etc/cgroup-policies/tenants.json"
BASE = pathlib.Path("/sys/fs/cgroup/tenants")

def apply_policy(tenant, cfg):
    cg = BASE / tenant
    cg.mkdir(parents=True, exist_ok=True)
    # Enable controllers for the subtree
    (cg / "cgroup.subtree_control").write_text("+cpu +memory +io")
    # Apply limits
    (cg / "cpu.max").write_text(f"{cfg['cpu_quota']} {cfg['cpu_period']}")
    (cg / "memory.max").write_text(str(cfg['memory_max']))
    if 'io' in cfg:
        io_line = f"{cfg['io']['dev']} {cfg['io']['type']} {cfg['io']['limit']}"
        (cg / "io.max").write_text(io_line)

if __name__ == "__main__":
    with open(SPEC_PATH) as f:
        data = json.load(f)
    for tenant, cfg in data.items():
        apply_policy(tenant, cfg)

Deploy this script with a systemd timer that runs every 5 minutes, ensuring that any drift caused by manual tinkering is corrected automatically.

Integrating with Observability Stack

cgroups v2 exposes pressure metrics that can be scraped by Prometheus using the node_exporter collector cgroup. Example Prometheus rule to alert on memory pressure:

- alert: HighMemoryPressure
  expr: node_memory_pressure_seconds_total{type="memory", mode="some"} > 0.8
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "Memory pressure > 80% on {{ $labels.instance }}"
    description: "The cgroup {{ $labels.cgroup }} is experiencing sustained memory pressure."

Couple this with Grafana dashboards that plot cpu.max vs. actual CPU usage (node_cpu_seconds_total). The visual correlation quickly reveals whether a limit is too tight or if a noisy neighbor is stealing cycles.

Monitoring & Debugging in Production

Real‑time inspection with `systemd-cgtop`

$ systemd-cgtop
Path                 CPU%   Memory Current   Memory Limit
/               0.00%   0B               -
/system.slice   2.13%   1.2G / 32G       -
/user.slice     0.00%   0B               -
/webapp.slice   15.7%   1.6G / 2G        -

systemd-cgtop aggregates per‑slice usage, letting you spot runaway services instantly.

Using `cgroupfs` tools

The cgroupfs-mount utility from the cgroupfs-mount project can verify that the hierarchy is correctly mounted and that all controllers are active.

$ cgroupfs-mount -v
cgroup2 mounted on /sys/fs/cgroup (controllers: cpu io memory pids)

If you see missing controllers, check the kernel config (CONFIG_CGROUP_*) or the cgroup.controllers file.

Debugging I/O throttling

When io.max limits cause latency, the iostat command can be combined with cgroup stats:

$ cat /sys/fs/cgroup/webapp.slice/io.stat
8:0 rbytes=104857600 wbytes=52428800 rios=5000 wios=2500

Cross‑reference iostat -x for the underlying device to confirm whether the limit is the bottleneck or the storage subsystem itself.

Performance Tuning Tips

Prefer cpu.max over cpu.shares – The former provides an absolute quota, eliminating the “share‑based” ambiguity that caused the “bursty” behaviour in v1.
Set memory.high in addition to memory.max – memory.high triggers reclamation before the hard limit, smoothing out sudden spikes.
Leverage pids.max for fork‑bomb protection – In environments that run untrusted code (CI runners), limiting the number of processes per cgroup prevents resource exhaustion.
Combine cpu.idle with cpu.max – Enabling cpu.idle lets the kernel put idle tasks into a low‑power state, freeing cycles for other groups without changing quotas.
Use cgroup.freeze for graceful draining – When rolling out a new version, freeze the old cgroup, wait for in‑flight requests to finish, then destroy it.

Key Takeaways

cgroups v2 unifies all controllers under a single hierarchy, simplifying configuration and delegation.
Systemd slices and Kubernetes runtime classes are the most production‑ready ways to apply limits consistently.
Delegating controllers via cgroup.subtree_control reduces the attack surface and prevents accidental over‑provisioning.
Real‑time tools (systemd-cgtop, node_exporter) and pressure metrics give you early warning of contention.
Policy‑as‑code (e.g., Python script) automates limit enforcement and makes audits trivial.
Fine‑tuning (cpu.max, memory.high, pids.max) yields predictable performance without sacrificing flexibility.

Why cgroups v2 Matters for Production#

Unified hierarchy eliminates fragmentation#

Stronger guarantees through “thread‑aware” accounting#

Built‑in delegation for containers#

cgroups v2 Architecture Overview#

Example: Inspecting the root hierarchy#

Patterns in Production#

1. Systemd Slice per Service#

2. Delegating to Containers via CRI#

3. Per‑Tenant Sub‑trees in Multi‑Tenant SaaS#

Implementation Strategies#

Mounting cgroups v2 Early#

Automating Policy Generation#

Integrating with Observability Stack#

Monitoring & Debugging in Production#

Real‑time inspection with systemd-cgtop#

Using cgroupfs tools#

Debugging I/O throttling#

Performance Tuning Tips#

Key Takeaways#

Further Reading#