Table of Contents
- Introduction
- Observability Foundations for Distributed Systems
2.1. The Three Pillars: Metrics, Traces, Logs
2.2. Challenges in Modern Cloud‑Native Environments - OpenTelemetry: The Vendor‑Neutral Telemetry Framework
3.1. Core Concepts
3.2. Instrumentation Libraries & SDKs
3.3. Exporters & Collectors - eBPF: In‑Kernel, Low‑Overhead Instrumentation
4.1. What is eBPF?
4.2. Typical Use‑Cases for Observability - Why Combine OpenTelemetry and eBPF?
- Architecture Blueprint
6.1. Data Flow Diagram
6.2. Component Interaction - High‑Performance Profiling with eBPF
7.1. Capturing CPU, Memory, and I/O
7.2. Sample eBPF Programs (BCC & libbpf) - Instrumenting Applications with OpenTelemetry
8.1. Automatic vs Manual Instrumentation
8.2. Go Example: Tracing an HTTP Service
8.3. Python Example: Exporting Metrics to Prometheus - Bridging eBPF Data into OpenTelemetry Pipelines
9.1. Custom Exporter for eBPF Metrics
9.2. Using OpenTelemetry Collector with eBPF Receiver - Visualization & Alerting
10.1. Grafana Dashboards for eBPF‑derived Metrics
10.2. Jaeger/Tempo for Distributed Traces - Real‑World Case Study: Scaling a Microservice Platform
- Best Practices & Common Pitfalls
- Conclusion
- Resources
Introduction
Observability has become the cornerstone of modern distributed systems. As microservice architectures, serverless functions, and edge workloads proliferate, engineers need deep, low‑latency insight into what their code is doing across the entire stack—from the kernel up to the application layer. Traditional monitoring tools either incur prohibitive overhead or lack the granularity required to troubleshoot performance regressions in real time.
Enter OpenTelemetry, the community‑driven, vendor‑neutral standard for collecting traces, metrics, and logs. And eBPF (extended Berkeley Packet Filter), a powerful in‑kernel virtual machine that enables safe, high‑performance instrumentation without modifying application code.
This article explores how to master distributed systems observability by combining OpenTelemetry’s flexible telemetry pipeline with eBPF’s low‑overhead profiling capabilities. We’ll walk through the theory, show practical code snippets, and present a real‑world deployment scenario that demonstrates the value of this partnership.
Note: The concepts discussed assume familiarity with Linux, container orchestration (e.g., Kubernetes), and basic observability terminology. Newcomers may want to skim the “Observability Foundations” section before diving deeper.
Observability Foundations for Distributed Systems
The Three Pillars: Metrics, Traces, Logs
| Pillar | Definition | Typical Use‑Case |
|---|---|---|
| Metrics | Numeric time‑series data (counters, gauges, histograms) | Capacity planning, SLA compliance |
| Traces | End‑to‑end request journeys across services, composed of spans | Root‑cause analysis of latency spikes |
| Logs | Immutable, timestamped records of events | Debugging, audit trails |
A robust observability strategy blends all three. While logs provide raw detail, metrics give a quick health snapshot, and traces expose the causal flow of requests.
Challenges in Modern Cloud‑Native Environments
- Dynamic Topology – Pods and containers appear/disappear at scale, making static instrumentation brittle.
- High Cardinality – Service mesh sidecars, per‑request IDs, and user identifiers generate massive label spaces.
- Performance Overhead – Adding instrumentation can degrade latency, especially in latency‑sensitive microservices.
- Data Deluge – Storing and querying billions of data points requires efficient pipelines and storage back‑ends.
Addressing these challenges demands low‑overhead, automatic data collection that can survive rapid topology changes—precisely where eBPF shines.
OpenTelemetry: The Vendor‑Neutral Telemetry Framework
Core Concepts
- Instrumentation – Code (auto or manual) that produces telemetry data.
- SDK – Language‑specific implementation handling context propagation, sampling, and exporting.
- Collector – A process (or sidecar) that receives telemetry from SDKs, performs processing (e.g., batching, attribute enrichment), and forwards to back‑ends.
- Exporters – Pluggable components that ship data to observability platforms such as Prometheus, Jaeger, or Tempo.
Instrumentation Libraries & SDKs
OpenTelemetry provides auto‑instrumentation agents for Java, .NET, Node.js, and Python, among others. Manual instrumentation gives fine‑grained control and is often necessary for custom business logic.
// Go manual instrumentation example
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/trace"
"net/http"
)
var tracer = otel.Tracer("example.com/quickstart")
func handler(w http.ResponseWriter, r *http.Request) {
ctx, span := tracer.Start(r.Context(), "handler")
defer span.End()
// Simulate work
time.Sleep(50 * time.Millisecond)
w.Write([]byte("Hello, OpenTelemetry!"))
// Attach attributes if needed
span.SetAttributes(attribute.String("method", r.Method))
}
Exporters & Collectors
The OpenTelemetry Collector can be deployed as a daemonset in Kubernetes, a sidecar, or a standalone service. Exporters like otlp, prometheusreceiver, and jaegerexporter allow seamless integration with downstream systems.
eBPF: In‑Kernel, Low‑Overhead Instrumentation
What is eBPF?
eBPF is a virtual machine inside the Linux kernel that runs sandboxed bytecode. Programs are attached to tracepoints, kprobes, uprobes, sockets, or XDP (eXpress Data Path) hooks. Because they execute in kernel space, they can capture events with nanosecond precision and minimal context switching.
Key properties:
- Safety – Verified by the kernel verifier before loading.
- Performance – Overhead often <1% even at high event rates.
- Flexibility – Can read kernel structures, network packets, or user‑space symbols.
Typical Use‑Cases for Observability
- System‑wide profiling (CPU, memory, I/O).
- Network latency tracing (socket events, TCP retransmissions).
- Security monitoring (syscall auditing, file access).
- Application‑specific metrics via uprobes on binary functions.
Why Combine OpenTelemetry and eBPF?
| OpenTelemetry | eBPF |
|---|---|
| Unified data model (OTLP) | Zero‑overhead kernel tracing |
| Vendor‑agnostic exporters | Access to low‑level kernel events |
| Automatic context propagation | Ability to instrument binaries without source changes |
| Rich ecosystem (Grafana, Jaeger) | Real‑time profiling across containers/pods |
Synergy: eBPF collects high‑resolution system metrics and events that OpenTelemetry can normalize and export alongside application traces and logs. This creates a single observability pipeline capable of answering “Why is request X slow?” with data ranging from TCP retransmissions to database query spans.
Architecture Blueprint
Data Flow Diagram
+-------------------+ +-------------------+ +-------------------+
| Application | ---> | OpenTelemetry SDK| ---> | OpenTelemetry |
| (Go, Python, …) | | (auto/manual) | | Collector |
+-------------------+ +-------------------+ +-------------------+
| |
v v
+---------------------------+ +-------------------+
| eBPF Probe (BCC/libbpf) | | Exporters (OTLP, |
| - CPU, I/O, Syscalls | | Prometheus, Jaeger)|
+---------------------------+ +-------------------+
| |
v v
+-----------------------------------------------+
| Observability Backend (Grafana, Tempo, etc.)|
+-----------------------------------------------+
Component Interaction
- eBPF probes run on each node, gathering kernel‑level metrics.
- A user‑space daemon (e.g.,
ebpf_exporteror a custom collector) translates raw eBPF data into OTLP metric format. - The OpenTelemetry Collector receives both application telemetry (via SDKs) and eBPF‑derived metrics, applies processing (aggregation, filtering), and forwards to back‑ends.
- Visualization tools ingest the unified data, allowing engineers to correlate system‑level events with application traces.
High‑Performance Profiling with eBPF
Capturing CPU, Memory, and I/O
eBPF can attach to kernel tracepoints such as sched:sched_switch for context switches, mm:page_fault for page faults, and block:block_rq_issue for block I/O.
Example using BCC (Python front‑end):
# cpu_profile.py
from bcc import BPF
import time
# BPF program: count CPU time per process
prog = """
int on_sched_switch(struct pt_regs *ctx, struct task_struct *prev) {
u64 pid = prev->pid;
u64 delta = bpf_ktime_get_ns() - prev->start_time;
bpf_trace_printk("PID %d spent %llu ns\\n", pid, delta);
return 0;
}
"""
b = BPF(text=prog)
b.attach_tracepoint(tp="sched:sched_switch", fn_name="on_sched_switch")
print("Tracing... Hit Ctrl-C to end.")
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
pass
The program prints per‑process CPU usage with nanosecond precision, incurring negligible overhead.
Sample eBPF Programs (BCC & libbpf)
BCC (Python) – Network Latency
# tcp_rtt.py
from bcc import BPF
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <net/tcp.h>
BPF_HASH(start, u64, u64);
BPF_HISTOGRAM(rtt_hist);
int trace_tcp_rcv_established(struct pt_regs *ctx, struct sock *sk) {
u64 ts = bpf_ktime_get_ns();
u64 pid = bpf_get_current_pid_tgid();
start.update(&pid, &ts);
return 0;
}
int trace_tcp_rcv_established_ret(struct pt_regs *ctx) {
u64 *tsp, delta;
u64 pid = bpf_get_current_pid_tgid();
tsp = start.lookup(&pid);
if (tsp == 0) { return 0; }
delta = bpf_ktime_get_ns() - *tsp;
rtt_hist.increment(bpf_log2l(delta / 1000)); // microseconds
start.delete(&pid);
return 0;
}
"""
b = BPF(text=bpf_text)
b.attach_kprobe(event="tcp_rcv_established", fn_name="trace_tcp_rcv_established")
b.attach_kretprobe(event="tcp_rcv_established", fn_name="trace_tcp_rcv_established_ret")
libbpf (C) – CPU Flamegraph
// cpu_flame.c
#include <bpf/bpf_helpers.h>
struct {
__uint(type, BPF_MAP_TYPE_STACK_TRACE);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u64));
__uint(max_entries, 10240);
} stack_traces SEC(".maps");
SEC("tracepoint/sched/sched_switch")
int on_switch(struct trace_event_raw_sched_switch *ctx) {
u64 pid = bpf_get_current_pid_tgid();
u32 stack_id = bpf_get_stackid(ctx, &stack_traces, BPF_F_USER_STACK);
if (stack_id >= 0) {
bpf_map_update_elem(&stack_traces, &pid, &stack_id, BPF_ANY);
}
return 0;
}
char LICENSE[] SEC("license") = "GPL";
Compile with clang -O2 -target bpf -c cpu_flame.c -o cpu_flame.o and load via bpftool or a Go libbpf wrapper. The resulting stack traces can be fed into a flamegraph visualizer.
Instrumenting Applications with OpenTelemetry
Automatic vs Manual Instrumentation
| Approach | Pros | Cons |
|---|---|---|
| Automatic (agents) | Zero code changes, quick rollout | Limited to supported frameworks, may miss business‑logic spans |
| Manual | Full control, custom attributes | Requires developer effort, risk of missing context propagation |
A hybrid approach is common: auto‑instrument HTTP servers and databases, then add manual spans around critical business functions.
Go Example: Tracing an HTTP Service
package main
import (
"log"
"net/http"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
"go.opentelemetry.io/otel/sdk/trace"
"go.opentelemetry.io/otel/trace"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)
func initTracer() func() {
ctx := context.Background()
exporter, err := otlptracehttp.New(ctx)
if err != nil {
log.Fatalf("failed to create exporter: %v", err)
}
tp := trace.NewTracerProvider(trace.WithBatcher(exporter))
otel.SetTracerProvider(tp)
return func() { _ = tp.Shutdown(ctx) }
}
func main() {
shutdown := initTracer()
defer shutdown()
mux := http.NewServeMux()
mux.HandleFunc("/order", func(w http.ResponseWriter, r *http.Request) {
ctx, span := otel.Tracer("order-service").Start(r.Context(), "processOrder")
defer span.End()
// Simulate DB call
time.Sleep(30 * time.Millisecond)
// Add custom attributes
span.SetAttributes(attribute.String("order.id", "12345"))
w.Write([]byte("order placed"))
_ = ctx // use ctx for downstream calls
})
// Wrap with otelhttp middleware for automatic trace propagation
wrapped := otelhttp.NewHandler(mux, "order-server")
log.Println("Listening on :8080")
log.Fatal(http.ListenAndServe(":8080", wrapped))
}
Python Example: Exporting Metrics to Prometheus
# app.py
from prometheus_client import start_http_server, Counter
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader
# Set up OpenTelemetry metrics pipeline
reader = PrometheusMetricReader()
provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)
meter = metrics.get_meter(__name__)
# Define a counter metric
request_counter = meter.create_counter(
name="http_requests_total",
description="Total HTTP requests",
unit="1",
)
def handle_request():
request_counter.add(1, {"method": "GET", "endpoint": "/health"})
# Business logic here
if __name__ == "__main__":
# Expose Prometheus endpoint on :9464
start_http_server(9464)
while True:
handle_request()
time.sleep(1)
Running this script exposes metrics at http://localhost:9464/metrics, which can be scraped by Prometheus and visualized in Grafana alongside eBPF‑derived data.
Bridging eBPF Data into OpenTelemetry Pipelines
Custom Exporter for eBPF Metrics
A lightweight daemon can read BPF maps (e.g., via bpf_map_lookup_elem) and push the data as OTLP metric records.
// ebpf_exporter.go (simplified)
package main
import (
"context"
"log"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"
"go.opentelemetry.io/otel/sdk/metric"
"github.com/iovisor/gobpf/bcc"
)
func main() {
// Initialize OTLP metric exporter
exp, _ := otlpmetrichttp.New(context.Background())
provider := metric.NewMeterProvider(metric.WithReader(metric.NewPeriodicReader(exp)))
otel.SetMeterProvider(provider)
meter := otel.Meter("ebpf-exporter")
cpuUsage := metric.Must(meter).NewFloat64Gauge("node_cpu_usage_seconds_total")
// Load BPF program (compiled elsewhere) and get map reference
module := bcc.NewModule(bpfSource, []string{})
cpuMap := module.Table("cpu_usage")
for {
// Iterate map entries
for _, kv := range cpuMap.Iter() {
pid := binary.LittleEndian.Uint32(kv.Key)
ns := binary.LittleEndian.Uint64(kv.Leaf)
cpuUsage.Record(context.Background(), float64(ns)/1e9, attribute.Int("pid", int(pid)))
}
time.Sleep(5 * time.Second)
}
}
The exporter runs as a DaemonSet in Kubernetes, ensuring each node forwards its eBPF metrics to the collector.
Using OpenTelemetry Collector with eBPF Receiver
The OpenTelemetry Collector Contrib now offers an ebpfreceiver (experimental) that directly reads eBPF maps and converts them to OTLP. Configuration example:
receivers:
otlp:
protocols:
grpc:
ebpf:
collection_interval: 10s
metrics:
- name: cpu.time
map_name: cpu_usage
type: histogram
processors:
batch:
exporters:
prometheus:
endpoint: "0.0.0.0:9090"
otlp:
endpoint: "tempo:4317"
tls:
insecure: true
service:
pipelines:
metrics:
receivers: [otlp, ebpf]
processors: [batch]
exporters: [prometheus, otlp]
Deploying this collector alongside your workloads gives you a single source of truth for both application‑level and kernel‑level metrics.
Visualization & Alerting
Grafana Dashboards for eBPF‑derived Metrics
Grafana’s Explore and Dashboard features can ingest Prometheus metrics produced by the collector. A typical dashboard includes:
- CPU Usage by PID (heatmap) – correlates spikes with trace IDs.
- Disk I/O latency – stacked bar per pod.
- Network retransmissions – line chart with alerts on thresholds.
Grafana also supports templating, allowing you to drill down from a high‑level service view to the exact kernel events that caused the slowdown.
Jaeger/Tempo for Distributed Traces
Jaeger UI (or Grafana Tempo) visualizes spans collected via OTLP. By adding span attributes that reference eBPF‑collected IDs (e.g., pid, cgroup), you can link a trace to its underlying system metrics.
Tip: Use the
process.runtime.idattribute (available in most SDKs) together with thepidmetric to create a cross‑reference table in Grafana.
Real‑World Case Study: Scaling a Microservice Platform
Background:
A fintech company runs a Kubernetes‑based microservice platform handling 50 k requests/sec during peak trading hours. They observed occasional latency spikes (>2 s) that standard APM tools could not explain.
Approach:
Deploy eBPF probes on all worker nodes to capture:
sched:sched_switchfor CPU context switches.block:block_rq_issue/block:block_rq_completefor disk I/O.tcp:tcp_probefor TCP retransmissions.
Add OpenTelemetry auto‑instrumentation to Java Spring Boot services and Go API gateways.
Run a custom OpenTelemetry Collector with the
ebpfreceiverandotlpreceiver. Export metrics to Prometheus and traces to Tempo.Build Grafana dashboards that overlay per‑service latency (from traces) with node‑level CPU and I/O histograms.
Findings:
- Spikes correlated with high kernel‑level CPU steal time on nodes co‑located with a noisy batch‑processing job.
- A particular service (
order-service) exhibited excessive TCP retransmissions due to a misconfigured MTU on its outbound NIC. - eBPF‑derived disk latency revealed a storage bottleneck on a subset of nodes running stateful workloads.
Outcome:
By adjusting the batch job’s CPU limits and correcting the MTU, latency variance dropped from 2 s to <200 ms. The unified observability pipeline also reduced mean time to detection (MTTD) from 30 min to <5 min.
Best Practices & Common Pitfalls
| Practice | Why It Matters |
|---|---|
| Run eBPF probes as non‑root where possible | Modern kernels allow unprivileged BPF loading (CAP_BPF). Reduces attack surface. |
| Sample rather than trace everything | High‑frequency events (e.g., every sched_switch) can overwhelm collectors; use probabilistic sampling. |
| Tag eBPF metrics with Kubernetes metadata (pod, namespace, node) | Enables correlation with service‑level data. Use cgroup IDs or the k8s field in the collector’s attributes processor. |
| Version‑lock OpenTelemetry components | Breaking changes in SDKs or collectors can cause data loss. Pin to known releases. |
| Validate end‑to‑end latency | Instrument both client and server sides; ensure context propagation (W3C Trace‑Context) works across language boundaries. |
| Monitor collector resource usage | Collector can become a bottleneck; scale horizontally or use sidecar mode per namespace. |
Common Pitfalls
- Over‑instrumentation – Adding spans to every function leads to massive trace volumes and high overhead. Use selective instrumentation.
- Missing kernel symbols – When uprobes target stripped binaries, eBPF may fail. Deploy debug symbols or use
perf‑compatible symbols. - Cardinality explosion – Adding too many labels (e.g., per‑user IDs) can blow up Prometheus storage. Aggregate or hash high‑cardinality attributes.
- Incorrect time synchronization – eBPF timestamps are kernel monotonic; OTLP expects UTC. Convert using
clock_gettime(CLOCK_REALTIME)when merging.
Conclusion
Mastering observability in distributed systems requires visibility at every layer—from the high‑level request flow down to the low‑level kernel events that shape performance. By marrying OpenTelemetry’s flexible, vendor‑agnostic telemetry model with eBPF’s ultra‑low‑overhead, in‑kernel profiling, you gain a unified pipeline that:
- Captures rich, correlated data (traces, metrics, logs, and kernel events).
- Operates at nanosecond precision without sacrificing application latency.
- Scales horizontally across Kubernetes clusters using standard collector deployments.
- Enables real‑time troubleshooting and proactive alerting through powerful visualizations.
Implementing this stack may involve an upfront investment in eBPF program development and collector configuration, but the payoff—faster MTTR, reduced infrastructure waste, and a deeper understanding of system behavior—far outweighs the cost. As cloud‑native workloads continue to grow in complexity, the combination of OpenTelemetry and eBPF will become a cornerstone of high‑performance, production‑grade observability.
Resources
OpenTelemetry Documentation – Comprehensive guides, SDK references, and collector configuration.
https://opentelemetry.io/docs/eBPF.io – The eBCC & libbpf Projects – Tutorials, examples, and API docs for writing eBPF programs.
https://ebpf.io/Grafana Labs – Observability with eBPF – Blog post and dashboard templates for integrating eBPF metrics.
https://grafana.com/blog/2023/06/14/observability-with-ebpf/Jaeger – Open-source Distributed Tracing – Deployment guides and UI walkthroughs.
https://www.jaegertracing.io/Prometheus – Monitoring System & Time Series Database – Official documentation for scraping and alerting.
https://prometheus.io/docs/