Continuous Profiling for Go: eBPF, Pyroscope, and OpenTelemetry

For years, the industry has rallied around the 'three pillars of observability': metrics, logs, and traces. While these signals are excellent for identifying that a system is failing or where a delay occurs in a distributed transaction, they often fail to answer the most fundamental resource question: Why is my CPU at 90%?

In Go microservices, we've traditionally relied on pprof for ad-hoc profiling. However, capturing a profile after an incident has already occurred is reactive. By the time you log in to run go tool pprof, the spike is often gone. This is where continuous profiling comes in. By integrating OpenTelemetry (OTel) with eBPF-based agents like Pyroscope, we can capture a granular, perpetual record of every function call's CPU impact across our entire fleet with negligible overhead.

The Shift to Continuous Profiling

Continuous profiling is the process of constantly taking small samples of an application's resource usage (CPU, memory, goroutines) and aggregating them over time. Unlike standard metrics which provide a single number (e.g., CPU utilization), profiles provide a stack trace.

In a Go context, this means instead of seeing a dashboard that says 'Service A is using 4 cores,' you see a flame graph showing that json.Unmarshal is consuming 2.5 of those cores because of an inefficient struct tags implementation.

Why OpenTelemetry Profiling?

OpenTelemetry has recently formalized profiling as a first-class signal. This is a significant milestone for technical decision-makers because it prevents vendor lock-in. By using the OTel profiling specification, you ensure that your profiling data can be transported via the OpenTelemetry Protocol (OTLP) and consumed by any backend that supports the standard—whether that is Grafana Cloud, Honeycomb, or an internal Pyroscope instance.

Leveraging eBPF for Zero-Instrumentation Profiling

eBPF (Extended Berkeley Packet Filter) has revolutionized observability by allowing us to run sandboxed programs in the Linux kernel without changing kernel source code or loading modules. For Go developers, the primary advantage of eBPF-based profiling is that it is out-of-process and zero-instrumentation.

Historically, to profile a Go app, you had to import net/http/pprof and expose an endpoint. While effective, this requires code changes and creates a dependency on the application runtime's health. If the application is under heavy lock contention or its scheduler is overwhelmed, the pprof endpoint itself might become unresponsive.

eBPF avoids this by sampling from the kernel level. The eBPF agent (like the Pyroscope agent or Grafana Alloy) periodically interrupts the CPU, looks at the currently executing process, and walks the stack. Because it operates at the kernel level, it can see across the entire system, including CGO calls and kernel-space execution, which standard pprof often misses.

The Architecture: Go, OTel, and Pyroscope

To implement this in a production Go environment, we typically look at a three-tier architecture:

The Agent (eBPF): A daemon (usually a DaemonSet in Kubernetes) that uses eBPF to sample the stack traces of all running containers. It attaches metadata (pod name, namespace, container ID) using the OTel resource attributes.
The Collector: An OpenTelemetry Collector that receives profiling data via OTLP. It can process, filter, and batch the data before sending it to the storage backend.
The Backend (Pyroscope): A specialized database designed for storing and querying multi-dimensional profiling data. Pyroscope uses 'trees' instead of flat lists, allowing it to store massive amounts of stack trace data efficiently through compression.

Practical Implementation Strategy

When deploying this for Go microservices, the most robust approach is using the Grafana Alloy (formerly the Pyroscope Agent) in eBPF mode. Here is how you would typically configure your Go environment to be 'profile-ready.'

1. Preparing the Go Binary

While eBPF doesn't require code changes, it does require symbols. If you strip your Go binaries to reduce size (go build -ldflags="-s -w"), the profiler will see memory addresses but won't be able to map them back to function names like main.processOrder.

Recommendation: Keep your symbols in production. The overhead is just a few megabytes of disk space, but the observability gain is immeasurable.

2. Configuring the eBPF Profiler

In a Kubernetes environment, you would deploy the agent with permissions to access the host's eBPF subsystem. The configuration (using Alloy's syntax) looks something like this:

pyroscope.ebpf "default" {
  forward_to = [pyroscope.write.backend.receiver]
  targets    = discovery.relabel.kubernetes_pods.output
}

pyroscope.write "backend" {
  endpoint = "http://pyroscope:4040"
}

This configuration automatically discovers pods and starts scraping CPU profiles. The agent handles the heavy lifting of mapping the Linux process IDs (PIDs) to your Kubernetes metadata, ensuring that when you look at a flame graph, you can filter by service_name or deployment_env.

Identifying CPU Hotspots: A Real-World Example

Consider a Go microservice that processes high-volume telemetry data. During a peak load, the service's latency increases, and CPU usage spikes. A traditional trace might show that the ProcessData function is slow, but it won't show why.

When we open the Pyroscope flame graph for that service, we might see a wide bar for runtime.mallocgc. This is a classic Go 'gotcha.'

The Discovery

A flame graph represents the stack depth on the Y-axis and the total CPU time on the X-axis. If runtime.mallocgc (the garbage collector's memory allocator) is taking up 30% of the CPU, we know our issue isn't logic—it's allocation pressure.

By digging deeper into the flame graph, we might find that a specific JSON library is creating millions of short-lived interface{} objects during serialization.

The Fix: We switch from the standard encoding/json to a zero-allocation library like easyjson or implement a sync.Pool to reuse buffers.

Without continuous profiling, identifying that the CPU spike was caused by GC pressure from JSON serialization would have required hours of manual log analysis and local benchmarking. With Pyroscope and eBPF, the answer is visible in seconds.

Correlating Profiles with Traces

The real 'North Star' of observability is correlation. OpenTelemetry makes this possible through Span Links.

When your Go service is instrumented with OTel Tracing, you can include the ProfileID in your trace metadata. Modern observability platforms allow you to click on a slow span in a trace and jump directly to the CPU profile for that specific time window and container.

This workflow transforms the debugging process:

Alert: P99 latency is high.
Trace: Find a specific slow request; see that it spent 500ms in a 'CalculateTax' span.
Profile: Click through to the flame graph for that 500ms window; see that the CPU was actually busy executing a regex compilation inside a loop.

Performance Impact and Overhead

A common concern with continuous profiling is the 'observer effect'—the fear that measuring the system will slow it down.

eBPF-based profiling is remarkably efficient. Because it uses sampling (e.g., 100Hz, or 100 times per second per CPU core) rather than tracing every single function call, the overhead is typically less than 1% CPU and negligible memory.

For Go applications, this is often lower than the overhead of traditional pprof scraping, as the agent doesn't need to enter the Go runtime as frequently, and the data aggregation happens out-of-process in the agent rather than inside your application's heap.

Best Practices for Production Deployment

Tag Everything: Use OTel resource attributes to tag your profiles with version, region, and container_id. This allows you to perform 'Diff Profiling'—comparing the CPU profile of v1.2.0 against v1.1.0 to see if a deployment introduced a regression.
Retention Management: Profiling data is dense. Configure your backend (Pyroscope/Grafana) to downsample data after 7 days and set a retention policy that balances cost with the need for historical analysis.
Focus on CPU First: While memory (heap) profiling is valuable, CPU profiling via eBPF provides the highest ROI for performance optimization in Go services, especially those handling network I/O or heavy computation.

Conclusion: Your Action Plan

Implementing continuous profiling with OpenTelemetry and eBPF is no longer a luxury reserved for FAANG-scale companies. It is a necessary component of a mature observability stack. To get started:

Audit your build pipeline: Ensure you aren't stripping debug symbols from your Go binaries.
Deploy an OTel-compatible agent: Use Grafana Alloy or the Pyroscope agent in a staging environment to measure the baseline overhead.
Integrate with your UI: Connect your profiling backend to your existing dashboards so developers can view flame graphs alongside their metrics and traces.

By moving from reactive pprof captures to continuous eBPF-based profiling, you stop guessing why your Go services are slow and start fixing the specific lines of code that are costing you money and performance.