Mastering cgroups v2 Resource Isolation: Implementation Strategies for Production Workload Management

TL;DR — cgroups v2 gives you a single unified hierarchy, fine‑grained controllers, and tighter integration with systemd. By enabling it at boot, defining per‑service resource profiles, and wiring metrics into Prometheus, you can reliably isolate CPU, memory, and I/O for any production workload—from a sidecar container to a stateful database.

Resource isolation is no longer a nicety; it’s a prerequisite for predictable latency, cost control, and safe multi‑tenant deployments. While many teams still cling to cgroups v1, the kernel’s v2 implementation eliminates the “multiple‑hierarchy” nightmare and provides a clean API that modern orchestrators like Kubernetes already understand. This post walks through the architecture of cgroups v2, shows how to turn it on in a production Linux host, and presents reusable patterns you can copy into your own CI/CD pipeline.

Why cgroups v2 Matters in Production

Evolution from v1 to v2

cgroups v1 exposed each controller (cpu, memory, blkio, etc.) as a separate virtual filesystem. On a busy host you could end up with dozens of mount points, each with its own hierarchy, making it easy to mis‑configure or double‑count resources. The kernel team introduced cgroups v2 to address those pain points:

Unified hierarchy – a single tree where every controller lives under the same node, eliminating cross‑hierarchy conflicts.
Threaded controller model – all controllers share the same lifecycle, simplifying cleanup.
Improved accounting – pressure stall information (PSI) gives a quantitative view of CPU, memory, and I/O contention.
Better integration with systemd – systemd now creates the cgroup tree automatically and exposes resource limits via systemd.resource-control (see the systemd docs).

Because production environments demand reproducibility, the deterministic behavior of v2 is a decisive advantage.

Architecture of cgroups v2

Hierarchy and Controllers

At its core, a cgroup v2 hierarchy is a single mount point, typically /sys/fs/cgroup. Under that mount you create directories that represent groups of processes. Each directory can have one or more controllers enabled, such as cpu.max, memory.max, and io.max. Controllers are enabled globally at mount time via the cgroup.subtree_control file.

# Mount cgroup2 with all controllers enabled (requires root)
mount -t cgroup2 none /sys/fs/cgroup
# Enable the three most common controllers for the root cgroup
echo "+cpu +memory +io" > /sys/fs/cgroup/cgroup.subtree_control

The kernel enforces limits per cgroup, not per process, which means you can move a whole service (all its threads) into a single isolated bucket with a single cgroup.procs write.

Integration with systemd and Kubernetes

systemd became the de‑facto manager of cgroup v2 on most distributions. When you start a unit, systemd creates a subdirectory under /sys/fs/cgroup named after the unit (e.g., system.slice/nginx.service). You can set limits directly in the unit file:

# /etc/systemd/system/nginx.service.d/override.conf
[Service]
CPUQuota=250%
MemoryMax=2G
IOWeight=500

Kubernetes 1.26+ also supports the cgroupv2 runtime class, allowing you to run pods on nodes that have v2 enabled without extra patches. The kubelet automatically creates the appropriate cgroup hierarchy under /sys/fs/cgroup/kubepods.slice. For deeper insight, see the Kubernetes RuntimeClass docs.

Implementation Strategies

1. Boot‑time Enablement

The safest way to guarantee that every process on the host participates in the v2 hierarchy is to enable it at boot. On most modern distros you can add a kernel command line parameter:

# /etc/default/grub
GRUB_CMDLINE_LINUX="cgroup_no_v1=all systemd.unified_cgroup_hierarchy=1"

After updating GRUB (update-grub on Debian/Ubuntu, grub2-mkconfig -o /boot/grub2/grub.cfg on RHEL) and rebooting, stat -fc %T /sys/fs/cgroup should return cgroup2fs. Verify with:

stat -fc %T /sys/fs/cgroup   # should output cgroup2fs

If you need to keep legacy workloads that still rely on v1, you can mount a mixed hierarchy (cgroup_no_v1=cpu,memory) but this defeats the purpose of isolation, so it’s best to migrate everything.

2. Per‑service Resource Profiles

Instead of hard‑coding limits in unit files, define reusable profiles in a central directory and include them via systemd drop‑ins. This approach mirrors the “policy as code” mindset popular in SRE teams.

# /etc/systemd/cgroup-profiles/webapp.conf
[Service]
CPUQuota=150%
MemoryMax=1G
IOWeight=300
# optional: protect against OOM kills
MemorySwapMax=0

Consume the profile:

# /etc/systemd/system/webapp.service
[Unit]
Description=Web Application
After=network.target

[Service]
ExecStart=/usr/local/bin/webapp
# Include the shared profile
Include=/etc/systemd/cgroup-profiles/webapp.conf

When you need to adjust the memory ceiling for an upcoming release, you edit a single file and reload systemd:

systemctl daemon-reload
systemctl restart webapp.service

3. Dynamic Scaling with `systemd-run`

For short‑lived jobs (batch processing, CI runners), you can create on‑the‑fly cgroups with systemd-run. This avoids the need to pre‑define a unit file.

systemd-run --unit=ci-job-$(date +%s) \
  --cpu-quota=200% \
  --memory-max=4G \
  --property=IOWeight=400 \
  /usr/local/bin/ci-runner.sh

Because systemd-run creates a transient unit, the cgroup disappears automatically when the process exits, ensuring no leftover resource reservations.

Patterns in Production

CPU & Memory Guarantees for Microservices

A common pattern is to give each microservice a soft CPU share (via cpu.weight) and a hard cap (cpu.max). The soft share allows the scheduler to prioritize busy services while the hard cap prevents any single pod from starving others.

# Example for a service that should never exceed 0.5 CPU cores
echo "50000 100000" > /sys/fs/cgroup/kubepods.slice/myservice.slice/cpu.max
# Soft weight of 200 (default is 100)
echo "200" > /sys/fs/cgroup/kubepods.slice/myservice.slice/cpu.weight

Memory limits are set with memory.max. Pair this with memory.swap.max=0 to disable swapping for latency‑sensitive services.

I/O Throttling for Databases

Databases like PostgreSQL often suffer from noisy neighbor I/O. Using io.max you can cap the number of read/write bytes per second per service.

# Limit PostgreSQL to 100 MB/s reads and 50 MB/s writes
cat <<EOF > /sys/fs/cgroup/kubepods.slice/postgres.slice/io.max
rbps=104857600 wbps=52428800
EOF

When combined with PSI metrics (see next section), you can detect when the database is throttled and trigger a scale‑out event.

Using cgroup Pressure Metrics for Autoscaling

cgroups v2 expose Pressure Stall Information (PSI) via /proc/pressure/<resource>. The fields some and full give a percentage of time the system was some or fully stalled due to contention.

# Sample PSI for CPU
cat /proc/pressure/cpu
# Output: some avg10=0.12 avg60=0.05 avg300=0.02 total=1234567

You can scrape these values with a tiny exporter:

# psi_exporter.py
import time, re, os
def read_psi(resource):
    with open(f"/proc/pressure/{resource}") as f:
        data = f.read()
    m = re.search(r"some avg10=([\d\.]+)", data)
    return float(m.group(1)) if m else 0.0

while True:
    cpu = read_psi("cpu")
    print(f"cpu_pressure_10s {cpu}")
    time.sleep(10)

Hook the metric into Prometheus, then write an HPA rule that scales a deployment when cpu_pressure_10s > 0.8.

Monitoring & Observability

Exporting stats to Prometheus

The cgroupfs exporter (maintained by the CNCF) already knows how to read cpu.stat, memory.stat, io.stat, and PSI files. Deploy it as a DaemonSet:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: cgroup-exporter
spec:
  selector:
    matchLabels:
      name: cgroup-exporter
  template:
    metadata:
      labels:
        name: cgroup-exporter
    spec:
      containers:
      - name: exporter
        image: quay.io/prometheus/cgroup-exporter:v0.2.0
        ports:
        - containerPort: 9102
          name: metrics
        volumeMounts:
        - name: cgroup
          mountPath: /sys/fs/cgroup
      volumes:
      - name: cgroup
        hostPath:
          path: /sys/fs/cgroup

Once scraped, you can build dashboards that show per‑service memory.usage_in_bytes, cpu.time, and PSI percentages.

Alerting on pressure stalls

A simple Prometheus rule:

# alerts.yml
groups:
- name: cgroup-psi
  rules:
  - alert: CpuPressureHigh
    expr: avg_over_time(node_pressure_cpu_some[5m]) > 0.75
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "CPU pressure > 75% for 5 minutes"
      description: "Node {{ $labels.instance }} is experiencing high CPU contention."

When this fires, your on‑call engineer can inspect the offending cgroup hierarchy and decide whether to add capacity or tighten limits.

Key Takeaways

Enable cgroups v2 at boot (systemd.unified_cgroup_hierarchy=1) to guarantee a single, predictable hierarchy for all processes.
Use systemd drop‑in profiles to centralise CPU, memory, and I/O limits; this makes policy changes auditable and repeatable.
Leverage systemd-run for transient jobs that need isolation without permanent unit files.
Combine hard caps (cpu.max, memory.max, io.max) with soft weights (cpu.weight, io.weight) to balance fairness and burstability.
Export PSI metrics to Prometheus and use them as first‑class signals for autoscaling or alerting.
Keep observability close to the cgroup layer; the kernel already provides per‑cgroup stats that are far more accurate than container‑runtime approximations.

Why cgroups v2 Matters in Production#

Evolution from v1 to v2#

Architecture of cgroups v2#

Hierarchy and Controllers#

Integration with systemd and Kubernetes#

Implementation Strategies#

1. Boot‑time Enablement#

2. Per‑service Resource Profiles#

3. Dynamic Scaling with systemd-run#

Patterns in Production#

CPU & Memory Guarantees for Microservices#

I/O Throttling for Databases#

Using cgroup Pressure Metrics for Autoscaling#

Monitoring & Observability#

Exporting stats to Prometheus#

Alerting on pressure stalls#

Key Takeaways#

Further Reading#