Implementing Cgroups v2 Resource Isolation: Control Groups, Unified Hierarchy, and Production Management Strategies

TL;DR — cgroups v2 replaces the fragmented v1 tree with a single unified hierarchy, giving you precise CPU, memory, and I/O controls. By wiring those controllers into systemd or a container runtime and adding automated enforcement, you can keep production workloads predictable and safe at scale.

Resource isolation is no longer a nice‑to‑have feature; it’s a hard requirement for any modern Linux service that runs dozens or hundreds of containers, batch jobs, or micro‑services on shared hardware. The second generation of control groups—cgroups v2—delivers a cleaner API, tighter integration with systemd, and a unified hierarchy that eliminates many of the edge‑case bugs that plagued the legacy v1 implementation. This post walks you through the core concepts, shows concrete commands, and then scales the discussion up to production‑grade architecture and operational patterns.

Understanding Cgroups v2 Fundamentals

cgroups (short for control groups) are a kernel feature that groups processes and applies resource limits, accounting, and isolation policies to the entire group. The v2 redesign, merged into the mainline kernel in 2015, addresses three pain points of v1:

Fragmented hierarchies – each controller (cpu, memory, blkio, …) could have its own tree, leading to inconsistent enforcement.
Inconsistent APIs – different controllers exposed different files and semantics, making automation error‑prone.
Limited nesting – v1 allowed some nesting but not in a predictable way, especially when mixing containers and system services.

In v2 there is one unified hierarchy rooted at /sys/fs/cgroup. All controllers attach to the same tree, and a single directory can host multiple resource limits. The kernel enforces that a child cannot exceed the limits of its parent, guaranteeing a natural “budget” inheritance model.

Core Concepts

Concept	Description
Unified hierarchy	A single tree where every controller is mounted.
Controller	A kernel module that enforces a specific resource type (e.g., `cpu`, `memory`, `io`).
cgroup.procs	File listing the PIDs belonging to the cgroup.
cgroup.subtree_control	Enables or disables child controllers for a subtree.
cgroup.events	Emits notifications (e.g., `memory.low`, `cpu.pressure`) useful for monitoring.

The unified approach lets you treat a cgroup like a resource container—you can hand it off to systemd, Kubernetes, or a custom orchestrator without worrying about mismatched hierarchies.

Unified Hierarchy vs. Legacy v1

Feature	cgroups v1	cgroups v2
Mount point	One per controller (`/sys/fs/cgroup/cpu`, `/sys/fs/cgroup/memory`, …)	Single mount (`/sys/fs/cgroup`)
Controller enable	Per‑tree via mount options	`cgroup.subtree_control` file
Nested limits	Inconsistent; often required manual coordination	Inherited automatically; child cannot exceed parent
API surface	Many files, each controller with its own syntax	Consistent file names (`cpu.max`, `memory.max`, `io.max`)
Integration	Ad‑hoc scripts, limited systemd support	Native systemd slice support, better container runtimes

Because most production teams already rely on systemd for service management, the v2 model aligns perfectly with systemd slices (system.slice, user.slice, etc.). When you create a slice, systemd automatically creates a matching cgroup under the unified hierarchy and populates cgroup.subtree_control with the controllers you request.

Example: Enabling Controllers in a Subtree

# Assume we are at /sys/fs/cgroup
mkdir myapp
cd myapp

# Enable cpu, memory, and io for this subtree
echo "+cpu +memory +io" > cgroup.subtree_control

The + syntax adds controllers to the subtree, making them available for child cgroups. If you forget to enable a controller, attempts to write to its control file will return EOPNOTSUPP.

Resource Controllers in Practice

Below we cover the three most common controllers used in production workloads: CPU, memory, and I/O. Each controller exposes a small set of files that can be read or written to adjust limits.

CPU Controller

The CPU controller in v2 uses a bandwidth model (cpu.max) and a weight model (cpu.weight). Bandwidth defines a hard quota; weight defines relative share when bandwidth is not exhausted.

# Set a hard limit of 200ms of CPU time every 1 second (200ms/1s = 20% of a core)
echo "200000 1000000" > cpu.max

# Give the group a weight of 200 (default is 100, range 1‑10000)
echo "200" > cpu.weight

The values are expressed in microseconds to avoid floating‑point rounding issues. When a container exceeds its quota, the kernel throttles it until the next period.

Memory Controller

Memory isolation is achieved via memory.max (hard limit) and memory.high (soft limit). The soft limit triggers reclamation but does not kill the cgroup.

# Hard limit of 2 GiB
echo "2G" > memory.max

# Soft limit of 1.5 GiB; the kernel will start reclaiming before hitting the hard limit
echo "1.5G" > memory.high

If a process tries to allocate beyond memory.max, it receives ENOMEM. Tools like systemd-cgtop can surface memory pressure per slice.

I/O Controller (blkio)

The I/O controller works with throttle (io.max) and weight (io.weight). You specify a device major:minor pair followed by limits.

# Limit reads to 5 MB/s and writes to 2 MB/s on /dev/sda (8:0)
echo "8:0 rbps=5M wbps=2M" > io.max

# Set the weight for the device (default 100, range 1‑1000)
echo "8:0 weight=300" > io.weight

When using containers, most runtimes automatically map the container’s block devices into the host’s cgroup namespace, so you can apply limits at the container level.

Architecture: Integrating Cgroups v2 into Container Orchestration

Most production environments run containers orchestrated by Kubernetes, Docker Swarm, or Nomad. While these platforms historically relied on the Docker runtime’s cgroups v1 support, they now expose a runtimeClass that can request a v2 configuration.

Systemd‑Managed Pods

Kubernetes on a systemd‑based host can enable the systemd cgroup driver. This driver creates a slice per pod (kubepods.slice) and a sub‑slice per container (kubepods-besteffort.slice, kubepods-burstable.slice, etc.). The slice hierarchy mirrors the pod‑to‑container relationship, making resource enforcement declarative.

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: cgroupsv2
handler: runc
overhead:
  podFixed:
    cpu: "500m"
    memory: "256Mi"

When a pod requests this runtimeClass, the kubelet passes --cgroup-driver=systemd to the container runtime, which then creates the appropriate cgroup under /sys/fs/cgroup/kubepods.slice. The pod’s spec.containers[].resources.limits are translated into cpu.max, memory.max, and io.max files automatically.

Direct Interaction via OCI Hooks

If you need finer‑grained control—say, a per‑tenant I/O quota that the orchestrator doesn’t expose—you can inject an OCI hook that runs after container creation but before the process starts. The hook can write directly to the cgroup files.

{
  "version": "1.0.0",
  "hooks": {
    "prestart": [
      {
        "path": "/usr/local/bin/set-cgroup-limits.sh",
        "args": ["set-cgroup-limits.sh", "cpu.max=50000 100000"],
        "env": []
      }
    ]
  }
}

The script set-cgroup-limits.sh would locate the container’s cgroup directory (available via $HOOK_STATE_DIR) and apply the limits. This pattern is used by high‑frequency trading firms that need sub‑millisecond latency guarantees.

Monitoring Integration

Production teams need visibility into cgroup pressure to avoid silent throttling. The unified hierarchy publishes pressure stall information (PSI) via cpu.pressure, memory.pressure, and io.pressure. Systemd can surface these as metrics via systemd-cgtop or systemd-analyze.

# Example: read CPU pressure (average over 10s, 60s, 300s)
cat cpu.pressure
# Output: some avg10=0.00 avg60=0.01 avg300=0.02 total=12345

Exporting these values to Prometheus is straightforward with the node_exporter collector cgroup or a custom exporter that reads the *.pressure files and pushes gauges.

Patterns in Production: Monitoring, Enforcement, and Scaling

Theoretical knowledge is only half the battle; production success hinges on repeatable patterns. Below are three proven strategies.

1. Hierarchical Budgeting

Create a top‑level “budget” cgroup for each tenant, team, or environment (dev, staging, prod). All workloads descend from this node, inheriting its limits. When a tenant exceeds its budget, you can either:

Throttle: Reduce cpu.max temporarily.
Evict: Move low‑priority containers into a “burst” cgroup with relaxed limits.
Alert: Emit a PagerDuty incident based on PSI thresholds.

# Tenant budget cgroup
mkdir /sys/fs/cgroup/tenants/acme
cd /sys/fs/cgroup/tenants/acme
echo "+cpu +memory +io" > cgroup.subtree_control
echo "4G" > memory.max
echo "2G" > memory.high

All services belonging to ACME are launched under this cgroup, ensuring they never exceed the agreed quota.

2. Auto‑Scaling with Pressure Feedback

Instead of static limits, use PSI as a trigger for scaling decisions. For example, when memory.pressure avg60 exceeds 0.10 (10 % of time under memory pressure), spin up an additional replica.

while true; do
  pressure=$(awk '{print $2}' /sys/fs/cgroup/kubepods.slice/memory.pressure)
  if (( $(echo "$pressure > 0.10" | bc -l) )); then
    kubectl scale deployment web --replicas=5
  fi
  sleep 30
done

This loop can run as a systemd service with Restart=always to ensure resilience.

3. Auditable Configuration as Code

Store cgroup configurations in Git alongside your Helm charts or Terraform modules. Use a CI step that validates the syntax (cgroupfs-mount lint) before applying changes. Example Terraform snippet:

resource "null_resource" "cgroup_limits" {
  provisioner "local-exec" {
    command = <<-EOT
      cgpath="/sys/fs/cgroup/kubepods.slice/${var.namespace}"
      mkdir -p "$cgpath"
      echo "+cpu +memory +io" > "$cgpath/cgroup.subtree_control"
      echo "${var.cpu_quota}" > "$cgpath/cpu.max"
      echo "${var.mem_limit}" > "$cgpath/memory.max"
    EOT
  }
}

By treating cgroup tweaks as code, you get version history, peer review, and rollback capabilities.

Key Takeaways

Unified hierarchy eliminates the fragmented trees of v1, giving you a single source of truth for all resource controllers.
Controller files (cpu.max, memory.max, io.max) follow a consistent, microsecond‑based syntax that is easy to script.
Systemd integration allows you to model resources as slices, aligning with existing service management practices.
Production patterns such as hierarchical budgeting, pressure‑driven auto‑scaling, and IaC‑driven cgroup configuration turn raw kernel features into reliable, observable services.
Monitoring PSI (*.pressure) provides early warning of throttling before users notice latency spikes, enabling proactive remediation.

Understanding Cgroups v2 Fundamentals#

Core Concepts#

Unified Hierarchy vs. Legacy v1#

Example: Enabling Controllers in a Subtree#

Resource Controllers in Practice#

CPU Controller#

Memory Controller#

I/O Controller (blkio)#

Architecture: Integrating Cgroups v2 into Container Orchestration#

Systemd‑Managed Pods#

Direct Interaction via OCI Hooks#

Monitoring Integration#

Patterns in Production: Monitoring, Enforcement, and Scaling#

1. Hierarchical Budgeting#

2. Auto‑Scaling with Pressure Feedback#

3. Auditable Configuration as Code#

Key Takeaways#

Further Reading#