TL;DR — eBPF lets you instrument the Linux kernel without kernel patches, providing low‑overhead, system‑wide tracing that scales to production. By combining BPFtrace, Cilium Hubble, and the BCC toolchain you can build a reusable observability stack that captures latency, resource usage, and security events safely and efficiently.
Observability is no longer a luxury reserved for micro‑service clouds; modern monoliths, edge devices, and hybrid workloads need the same depth of insight. Traditional tools (strace, perf, sysdig) either impose too much overhead or require invasive kernel changes. eBPF fills that gap with programmable, JIT‑compiled probes that run in kernel space, yet are verified for safety before execution. This article walks you through the architecture, production patterns, and hands‑on snippets you need to turn eBPF from a curiosity into a core component of your observability platform.
Why eBPF Matters for Production Observability
A performance‑first alternative to kernel patches
- Zero‑downtime deployment – eBPF programs are loaded at runtime via
bpftoolor higher‑level libraries, meaning you never need to reboot or rebuild the kernel. - Microsecond‑level overhead – Benchmarks from the Cilium team show <1 µs per probe on idle workloads, compared to 10‑100 µs for
perfevents. - Safety guarantees – The kernel verifier checks each program for illegal memory accesses, infinite loops, and stack overflows before it ever runs.
Real‑world adoption
Companies like Netflix, Stripe, and Cloudflare have publicly shared their eBPF‑based telemetry pipelines. Netflix, for example, uses eBPF to collect per‑request latency across its entire stack without adding latency to the request path, as described in their engineering blog1.
Core eBPF Building Blocks
| Component | Typical Use | Example Tool |
|---|---|---|
| BPF maps | Shared state between kernel and user space | hash, array, perf_event_array |
| kprobes / uprobes | Attach to kernel or user‑space functions | bpftrace -e 'kprobe:do_sys_open { ... }' |
| tracepoints | Low‑overhead, stable hooks | tracepoint:sched:sched_process_exit |
| XDP (eXpress Data Path) | Packet‑level processing at NIC driver level | Cilium, XDP‑based firewalls |
| Perf events | High‑resolution perf counters | perf record -e cycles:k |
Below is a minimal BCC program that counts execve syscalls per PID:
# execve_counter.py
from bcc import BPF
bpf_text = """
BPF_HASH(cnt, u32, u64);
int count_execve(struct pt_regs *ctx) {
u32 pid = bpf_get_current_pid_tgid() >> 32;
u64 *val, zero = 0;
val = cnt.lookup_or_init(&pid, &zero);
(*val)++;
return 0;
}
"""
b = BPF(text=bpf_text)
b.attach_kprobe(event="sys_execve", fn_name="count_execve")
print("Tracing execve... Hit Ctrl-C to end.")
try:
b.trace_print()
except KeyboardInterrupt:
pass
# Print map contents
for k, v in b["cnt"].items():
print(f"PID {k.value}: {v.value} execve calls")
The script compiles the C snippet, attaches a kprobe to sys_execve, and updates a hash map in the kernel. The verifier ensures the program never crashes, even if the map is full.
Architecture: Deploying System‑Wide Tracing with BPFtrace and Cilium
Below is a production‑grade diagram (textual representation) that shows how eBPF fits into a typical observability pipeline:
+-------------------+ +-------------------+ +-------------------+
| Application | ---> | Cilium Hubble | ---> | Prometheus |
| (container) | | (XDP + BPF) | | (scrape) |
+-------------------+ +-------------------+ +-------------------+
^ ^ ^
| | |
| BPFtrace agents on | BPF maps exported via |
| each node (daemon) | /sys/fs/bpf/* |
+-------------------------+--------------------------+
1. Edge collection with BPFtrace agents
Deploy a lightweight bpftrace daemon on every node (as a DaemonSet in Kubernetes). Each daemon runs a curated set of scripts:
#!/usr/bin/env bash
# /usr/local/bin/trace-agent.sh
set -euo pipefail
# Collect latency of HTTP handlers in Go programs
bpftrace -e '
tracepoint:syscalls:sys_enter_nanosleep /comm == "go"/ {
@latency[pid] = hist(nsecs);
}
' -o /var/run/bpftrace/http_latency.bt &
The agent writes histograms to a shared BPF map (/sys/fs/bpf/...) that Prometheus scrapes via cilium-hubble-exporter.
2. Cilium Hubble for networking observability
Cilium installs XDP programs on each NIC to capture packet metadata. Hubble aggregates these events and exposes them as Prometheus metrics. The combination of XDP (for packet‑level) and BPFtrace (for syscall‑level) gives you full‑stack visibility.
3. Central aggregation
Prometheus pulls metrics from the exporter, Grafana visualises them, and Alertmanager fires alerts when latency histograms cross thresholds.
Patterns in Production: Sampling, Filtering, and Aggregation
Sampling to reduce overhead
Instead of tracing every request, sample a configurable percentage:
// sample 1% of HTTP requests
tracepoint:syscalls:sys_enter_write /rand() % 100 == 0/ {
@samples[pid] = count();
}
The rand() function is evaluated in user space, but the filter runs in kernel space, discarding 99 % of events before they cross the user‑space boundary.
Dynamic filtering with map‑driven rules
You can update filter criteria at runtime by writing to a BPF map from a controller pod:
// C snippet for a map-driven filter
BPF_HASH(filter_pids, u32, u8);
int filter_execve(struct pt_regs *ctx) {
u32 pid = bpf_get_current_pid_tgid() >> 32;
u8 *allowed = filter_pids.lookup(&pid);
if (!allowed)
return 0; // drop event
// ... collect data ...
return 0;
}
A sidecar writes {pid: 1} into filter_pids via bpftool map update, instantly enabling tracing for that PID without reloading the program.
Aggregation with per‑CPU maps
Per‑CPU maps avoid lock contention when many CPUs generate events simultaneously:
BPF_PERCPU_ARRAY(cnt, u64, 1);
int count_sched(struct pt_regs *ctx) {
u64 *val = cnt.lookup(&zero);
if (val)
__sync_fetch_and_add(val, 1);
return 0;
}
When Prometheus scrapes, the exporter sums per‑CPU counters to produce a single metric.
Performance and Safety Considerations
| Concern | Mitigation |
|---|---|
| Map size exhaustion | Pre‑allocate generous map capacity, use LRU hash maps (BPF_MAP_TYPE_LRU_HASH) to evict stale entries. |
| JIT compilation latency | Warm‑up programs during pod init; the kernel caches JITed code for subsequent loads. |
| Verification failures | Keep programs under 4 KB of bytecode, avoid unbounded loops, and use helper functions (bpf_probe_read_user) for safe memory access. |
| Security sandbox | Load programs with CAP_SYS_ADMIN only, and enforce cgroup‑based eBPF restrictions (/sys/fs/bpf namespace). |
Measuring overhead with perf
perf stat -e cycles:u -a -- sleep 10
Run the command before and after loading your eBPF probes; a <2 % increase in cycles is typical for well‑scoped probes.
Debugging and Tooling
bpftool prog dump xlated– Dump JIT‑compiled bytecode to verify the compiler’s output.bpftrace -l– List all available probes on the host.cilium bpf map list– Inspect maps created by Cilium’s XDP programs.llvm-objdump -d– Disassemble compiled BPF object files for low‑level inspection.
When a program is rejected by the verifier, bpftool prog load prints a detailed log explaining the offending instruction. Example:
bpftool prog load prog.o /sys/fs/bpf/myprog type kprobe
If you see R0 invalid read from stack, it usually means you accessed more than 512 bytes of stack space, which the verifier caps.
Key Takeaways
- eBPF provides production‑grade tracing with microsecond‑scale overhead and no kernel recompilation.
- Combine BPFtrace for ad‑hoc syscall tracing, Cilium Hubble for XDP‑based networking observability, and BCC for custom aggregations.
- Use per‑CPU maps, LRU hash maps, and dynamic filter maps to keep performance predictable at scale.
- Safety is enforced by the kernel verifier; keep programs small, loop‑free, and use helper APIs for memory access.
- Deploy agents as DaemonSets, expose metrics via Prometheus exporters, and visualize with Grafana for a complete end‑to‑end stack.