TL;DR — Flame graphs are a powerful visual tool, but they inherit sampling bias and can hide short‑lived hot paths. By combining deterministic tracing, multi‑process aggregation, and careful bucket sizing, you can surface hidden execution costs in systems like Kafka, Go micro‑services, and Java workloads.
Performance engineers often reach for flame graphs when a latency spike defies explanation. The heat‑map style instantly shows which functions dominate CPU time, yet many teams treat the output as a silver bullet. In reality, flame graphs inherit the characteristics of the underlying sampler, can mask rare but expensive calls, and sometimes mislead when the sampling interval is too coarse. This post unpacks those blind spots, demonstrates how to mitigate sampling bias, and shows concrete patterns for wiring flame graphs into a production observability stack.
Understanding Flame Graphs
What a Flame Graph Is (and Isn’t)
A flame graph is a stack‑sample histogram rendered as a series of horizontally stacked boxes:
- Width = proportion of samples that included the function.
- Height = call‑stack depth (the “flame” grows upward).
- Color = optional categorization (e.g., user vs. kernel, or language runtime).
It does not measure wall‑clock time directly; it reflects how often a particular stack frame appeared in the sample set. The underlying sampler can be:
| Sampler | Typical Use‑Case | Resolution |
|---|---|---|
perf record (Linux) | Native binaries, kernel | 1 ms to 10 ms |
async-profiler (Java) | JVM bytecode | 1 ms |
py-spy (Python) | Interpreter frames | 5 ms |
eBPF‑based bpftrace | System‑wide events | sub‑ms (depending on config) |
Because the graph is a statistical representation, any systematic bias in the sampler propagates into the visual.
The Sampling Equation
The probability of observing a function f in a sample set S is:
P(f) = (time spent in f) / (total observed time) * (1 - e^{-λ·Δ})
λ= sampling rate (samples per second)Δ= average duration off
When Δ << 1/λ, the exponential term collapses to near zero, meaning short‑lived functions are under‑sampled. This is the core of the blind‑spot problem.
Sampling Mechanics and Bias
Fixed‑Interval vs. Randomized Sampling
| Method | Pros | Cons |
|---|---|---|
Fixed‑interval (e.g., perf default 99 Hz) | Simple to configure, predictable load | Can synchronize with periodic workloads, causing systematic under‑ or over‑representation |
Randomized (e.g., async-profiler with -f) | Breaks correlation with periodic code | Slightly higher CPU overhead, harder to reproduce exact runs |
Practical tip: For workloads that contain periodic timers (e.g., Kafka’s fetcher loop), enable jitter (--jitter) to avoid aliasing.
Over‑Sampling vs. Under‑Sampling
- Over‑sampling (high frequency) reduces statistical error but adds measurable overhead—often 2‑5 % of CPU on a busy service.
- Under‑sampling keeps overhead negligible (<0.2 %) but widens confidence intervals, especially for tail latencies.
A rule of thumb for production services:
# Aim for ~10 samples per second per CPU core
sudo perf record -F 100 -a -g -- sleep 30 # 100 Hz on a 4‑core box → ~400 samples/s
If the observed overhead exceeds 3 % (measure with perf stat), back off to 30–50 Hz.
Hidden Execution Costs: The “Micro‑Spikes”
Short, high‑cost operations (e.g., a 200 µs lock acquisition) may appear in <1 % of samples, translating to a barely visible bar. Yet they can dominate the 99th‑percentile latency. To surface them:
- Reduce the sampling interval (increase frequency) until the bar becomes visible.
- Combine with latency histograms (e.g., Prometheus
histogram_quantile) to correlate spikes. - Enable event‑based tracing for specific symbols (
bpftrace -e 'uprobe:/usr/lib/...:my_func { @[ustack] = count(); }').
Blind Spots in Production
Multi‑Process and Containerized Environments
When a service runs as multiple processes (e.g., a Kafka broker with separate I/O and request handler threads), a single perf record -a captures all, but the flame graph will merge stacks across processes. This can hide per‑process hot paths.
Pattern: Record per‑PID and merge with flamegraph.pl --colors=java,python,go:
# Collect per‑process data
for pid in $(pgrep -d' ' kafka); do
sudo perf record -F 99 -p $pid -g -o perf_$pid.data -- sleep 30 &
done
wait
# Merge
perf script -i perf_*.data | stackcollapse-perf.pl | flamegraph.pl > kafka.flamegraph.svg
JIT‑Compiled Languages
JIT runtimes (e.g., HotSpot JVM, Go’s runtime) dynamically generate code. If the sampler cannot resolve symbols for generated code, the flame graph will show a generic [jitted] block, obscuring the true hot path.
Mitigation: Use the runtime’s built‑in profiling support:
- Java:
-XX:+PreserveFramePointer -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepointsplusasync-profiler. - Go:
go tool pprof -http=:8080orgo tool tracefor deterministic traces.
Kernel‑User Boundary Blindness
A common mistake is to attribute high CPU to a user‑space function when the real cost is a kernel call (e.g., epoll_wait). Since many samplers aggregate kernel frames under a generic “kernel” bucket, the flame graph may give a false sense of security.
Solution: Enable --kstack in perf to capture kernel stacks and use color coding:
sudo perf record -F 99 -a -g --kstack -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl --colors=mem > full.flamegraph.svg
Now kernel frames appear in a distinct palette, making the boundary visible.
Architecture: Integrating Flame Graphs with an Observability Stack
High‑Level Diagram
+-------------------+ +-------------------+ +-------------------+
| Application | → | Sampling Agent | → | Data Collector |
| (Kafka, Go svc) | | (perf/async-prof) | | (Prometheus, Loki)|
+-------------------+ +-------------------+ +-------------------+
| |
v v
+-------------------+ +-------------------+
| Aggregator | --> | Storage (S3) |
| (Flamegraph.pl) | +-------------------+
+-------------------+
|
v
+-------------------+
| Dashboard (Grafana) |
+-------------------+
Component Breakdown
| Component | Responsibility | Production‑Ready Options |
|---|---|---|
| Sampling Agent | Collects stack samples at configurable rates; runs as a sidecar or host‑level daemon. | perf (Linux), async-profiler (JVM), py-spy (Python), eBPF agents (e.g., bpftrace). |
| Aggregator | Collapses raw stacks, merges per‑process data, generates SVG flame graphs. | Custom CI job using Brendan Gregg’s stackcollapse-* scripts; can be containerized (ghcr.io/brendangregg/FlameGraph). |
| Data Collector | Stores raw perf data and generated SVGs for later analysis; forwards metrics to Prometheus. | Loki for logs, MinIO/S3 for binary artifacts, Prometheus remote‑write for sample‑rate metrics. |
| Dashboard | Displays flame graphs alongside latency histograms, CPU usage, and alerts. | Grafana panels using the “Image” plugin; link to SVG stored in S3; embed interactive hover info via flamegraph.pl --title. |
Deployment Example (Kubernetes)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: perf-sampler
spec:
selector:
matchLabels:
name: perf-sampler
template:
metadata:
labels:
name: perf-sampler
spec:
hostPID: true
containers:
- name: sampler
image: ghcr.io/brendangregg/perf-agent:latest
securityContext:
privileged: true
args:
- "-F"
- "99"
- "-g"
- "--output=/data/perf_$(hostname).data"
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
hostPath:
path: /var/perf-data
The DaemonSet runs on every node, writes raw perf files to a shared host path, and a nightly CronJob aggregates them into flame graphs stored in an S3 bucket that Grafana reads from.
Patterns in Production: When Flame Graphs Reveal Hidden Costs
1. The “Lock‑Contention Mirage”
Symptom: Latency spikes with no obvious CPU hot spot.
Investigation: Enable --kstack and increase sampling to 200 Hz.
Result: Flame graph shows a thin, deep stack ending in pthread_mutex_lock. The bar is narrow because each lock hold lasts ~30 µs, below the default sampling window.
Pattern: When latency histograms show tail growth but CPU graphs look flat, look for deep, narrow stacks that indicate lock contention or short syscalls.
2. The “Garbage‑Collector Surprise”
Symptom: Go service experiences occasional GC pauses.
Investigation: Run go tool pprof -alloc_space alongside perf sampling.
Result: Flame graph reveals a wide runtime.mallocgc bar that spikes only when the -gcflags=all=-m flag is used, confirming allocation pressure.
Pattern: Pair deterministic GC traces with flame graphs to separate allocation hot spots from CPU‑bound work.
3. The “Network‑Stack Bottleneck”
Symptom: Kafka broker latency rises after a config change.
Investigation: Sample at the kernel level (perf record -F 99 -a -g --kstack).
Result: The flame graph’s bottom layer shows a massive sock_recvmsg bar, indicating the broker is spending more time in the kernel’s receive path than in user‑space request handling.
Pattern: Always include kernel stacks when profiling I/O‑heavy services; otherwise you’ll misattribute network‑related cost to application code.
4. The “Third‑Party Library Blind Spot”
Symptom: A Java micro‑service’s 99th‑percentile latency climbs after adding a new library.
Investigation: Use async-profiler with -e cpu and -f (flat mode) to isolate library symbols.
Result: Flame graph highlights a narrow com.fasterxml.jackson.databind.ObjectMapper._writeValue function that appears in <0.5 % of samples but consumes ~150 µs per call.
Pattern: When a new dependency is added, run a targeted flame graph focusing on that package to catch hidden per‑call overhead.
Key Takeaways
- Flame graphs are statistical; sampling frequency and jitter directly affect visibility of short‑lived hot paths.
- Always capture kernel stacks (
--kstack) for I/O‑bound services to avoid misattributing cost. - In multi‑process or containerized setups, merge per‑PID data rather than relying on a single system‑wide capture.
- Pair flame graphs with deterministic traces (e.g., Go pprof, Java Flight Recorder) to differentiate allocation pressure from CPU work.
- Integrate flame‑graph generation into your CI/CD pipeline: collect raw samples nightly, aggregate to SVG, and expose via Grafana for rapid root‑cause analysis.
Further Reading
- Brendan Gregg’s Flame Graphs – the canonical reference and collection of tooling scripts.
- Linux perf Documentation – detailed guide on sampling options, kstack, and performance counters.
- Async-profiler GitHub Repository – modern, low‑overhead profiler for Java and native code.
- Grafana Image Panel Plugin – embed generated flame‑graph SVGs in dashboards.
- bpftrace – Advanced Tracing with eBPF – write custom sampling scripts for kernel‑space and user‑space events.
