10. Observability & Diagnostics

Make behavior visible: structured logs, Prometheus metrics, tracing, and live profiling.

Question: What are the three pillars of observability, and how do you implement them in Go?

Answer: The three pillars are Logging, Metrics, and Tracing.

  • Logging: Use a structured logging library like slog (standard library), zerolog, or zap to emit logs as key-value pairs (JSON). This makes them machine-readable and easy to query.

  • Metrics: Use a library like prometheus/client_golang to instrument your code with counters, gauges, and histograms. Expose a /metrics endpoint for a Prometheus server to scrape.

  • Tracing: Use the OpenTelemetry SDK (go.opentelemetry.io/otel) to add distributed traces. Tracing follows a single request as it flows through multiple services, which is invaluable for debugging latency in microservices.

Explanation: A robust observability setup is non-negotiable for production services. Logs are for specific events, metrics are for aggregatable data, and traces are for understanding the lifecycle of a request. You should always include a correlation ID (or trace ID) in your logs to tie them back to a specific request.

Question: How do you enable and use pprof to diagnose performance issues in a running service?

Answer: Enable pprof either by blank‑importing net/http/pprof (registers handlers on the default mux) or by explicitly registering handlers on a custom mux. Expose it on an internal‑only port.

Explanation: Blank import registers under /debug/pprof/* on the default mux:

import _ "net/http/pprof"
// run an internal server: http.ListenAndServe(":6060", nil)

With a custom mux, register handlers explicitly:

mux.HandleFunc("/debug/pprof/", pprof.Index)
// ... other pprof handlers

Once enabled, use go tool pprof to connect and capture profiles:

  • CPU Profile: go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30 shows which functions are consuming the most CPU time.

  • Heap Profile: go tool pprof http://localhost:6060/debug/pprof/heap shows which functions are allocating the most memory.

  • Goroutine Profile: go tool pprof http://localhost:6060/debug/pprof/goroutine can help diagnose goroutine leaks by showing where they are blocked.

pprof provides the evidence needed for data-driven optimization. Don't guess where bottlenecks are; profile first.

Question: How do you capture mutex/block profiles from a running service?

Answer: Expose /debug/pprof/mutex and /debug/pprof/block (after enabling rates) and inspect via go tool pprof.

Explanation: Identify lock contention and blocking hotspots driving latency.

Question: What are metrics best practices (counters, gauges, histograms)?

Answer: Use counters for totals (monotonic), gauges for current values, and histograms/summaries for latency/size distributions.

Explanation: Prefer histograms with well-chosen buckets for SLOs. Name metrics with clear units and labels; avoid high-cardinality labels like user IDs.

Question: How do you propagate correlation/trace IDs in logs and metrics?

Answer: Extract IDs from incoming requests, put them into Context, and include them in logs/metrics/traces.

Explanation: Use structured logging with consistent keys (e.g., trace_id, request_id). Ensure downstream calls forward these IDs.

Question: When should you use log sampling?

Answer: Sample noisy logs in high-throughput paths to control cost while retaining visibility.

Explanation: Many logging libraries support sampling; ensure errors and warnings are exempt from aggressive sampling.