10. Observability & Diagnostics
Make behavior visible: structured logs, Prometheus metrics, tracing, and live profiling.
Question: What are the three pillars of observability, and how do you implement them in Go?
Answer: The three pillars are Logging, Metrics, and Tracing.
Logging: Use a structured logging library like
slog
(standard library),zerolog
, orzap
to emit logs as key-value pairs (JSON). This makes them machine-readable and easy to query.Metrics: Use a library like
prometheus/client_golang
to instrument your code with counters, gauges, and histograms. Expose a/metrics
endpoint for a Prometheus server to scrape.Tracing: Use the OpenTelemetry SDK (
go.opentelemetry.io/otel
) to add distributed traces. Tracing follows a single request as it flows through multiple services, which is invaluable for debugging latency in microservices.
Explanation: A robust observability setup is non-negotiable for production services. Logs are for specific events, metrics are for aggregatable data, and traces are for understanding the lifecycle of a request. You should always include a correlation ID (or trace ID) in your logs to tie them back to a specific request.
Question: How do you enable and use
pprof
to diagnose performance issues in a running service?
Answer: Enable pprof
either by blank‑importing net/http/pprof
(registers handlers on the default mux) or by explicitly registering handlers on a custom mux. Expose it on an internal‑only port.
Explanation: Blank import registers under /debug/pprof/*
on the default mux:
import _ "net/http/pprof"
// run an internal server: http.ListenAndServe(":6060", nil)
With a custom mux, register handlers explicitly:
mux.HandleFunc("/debug/pprof/", pprof.Index)
// ... other pprof handlers
Once enabled, use go tool pprof
to connect and capture profiles:
CPU Profile:
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
shows which functions are consuming the most CPU time.Heap Profile:
go tool pprof http://localhost:6060/debug/pprof/heap
shows which functions are allocating the most memory.Goroutine Profile:
go tool pprof http://localhost:6060/debug/pprof/goroutine
can help diagnose goroutine leaks by showing where they are blocked.
pprof
provides the evidence needed for data-driven optimization. Don't guess where bottlenecks are; profile first.
Question: How do you capture mutex/block profiles from a running service?
Answer: Expose /debug/pprof/mutex
and /debug/pprof/block
(after enabling rates) and inspect via go tool pprof
.
Explanation: Identify lock contention and blocking hotspots driving latency.
Question: What are metrics best practices (counters, gauges, histograms)?
Answer: Use counters for totals (monotonic), gauges for current values, and histograms/summaries for latency/size distributions.
Explanation: Prefer histograms with well-chosen buckets for SLOs. Name metrics with clear units and labels; avoid high-cardinality labels like user IDs.
Question: How do you propagate correlation/trace IDs in logs and metrics?
Answer: Extract IDs from incoming requests, put them into Context
, and include them in logs/metrics/traces.
Explanation: Use structured logging with consistent keys (e.g., trace_id
, request_id
). Ensure downstream calls forward these IDs.
Question: When should you use log sampling?
Answer: Sample noisy logs in high-throughput paths to control cost while retaining visibility.
Explanation: Many logging libraries support sampling; ensure errors and warnings are exempt from aggressive sampling.