7. Observability for Production
Make systems diagnosable: structured logs, RED metrics, traces, and safe profiling in prod.
Question: What are the "three pillars of observability," and how do you implement them in a Go service?
Answer: The three pillars are logs, metrics, and traces.
Logs: Detailed records of specific events. Implemented with a structured logging library (e.g.,
log/slog
,zerolog
) to output JSON.Metrics: Aggregatable numerical data about the system's health (e.g., request rates, error rates, duration). Implemented using a library like Prometheus client Go.
Traces: Show the end-to-end journey of a request through a distributed system. Implemented using the OpenTelemetry SDK.
Explanation: These three pillars provide a complete picture of system behavior. Metrics tell you that a problem is occurring (e.g., p99 latency is high). Traces tell you where in the system the problem is (e.g., a specific downstream service call is slow). Logs provide the detailed, low-level context to understand why it happened. A request ID should be present in all three to correlate them.
// OpenTelemetry Tracing Example
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
)
var tracer = otel.Tracer("my-service")
func Checkout(ctx context.Context, userID string) {
ctx, span := tracer.Start(ctx, "checkout")
defer span.End()
span.SetAttributes(attribute.String("user.id", userID))
// ... business logic ...
}
Question: How do you safely use pprof and continuous profiling in production?
Answer: Expose pprof on a protected admin port or behind auth; never on a public interface. For low-overhead always-on profiling, use continuous profilers (e.g., Pyroscope, Parca).
Explanation: pprof reveals internal state and can be heavy; secure endpoints and sample conservatively.
Question: What RED metrics should every service expose?
Answer: Requests, Errors, Duration (histogram). Segment by route/method/result; avoid high-cardinality labels.
Explanation: RED enables quick SLO-based alerting and capacity insights.
Question: How do you choose histogram buckets for latency SLOs?
Answer: Choose buckets around your SLO boundaries (e.g., 50/100/200/400ms for a 200ms p99), with extra resolution near the target.
Explanation: Proper buckets enable accurate percentiles without high memory; avoid overly granular buckets that explode cardinality.
Question: How do you trace cross-service calls with OpenTelemetry?
Answer: Propagate context with W3C traceparent
/tracestate
, inject/extract into HTTP/gRPC headers, and create spans per significant operation.
Explanation: Consistent propagation yields end-to-end traces for latency root-cause.
Question: Histograms vs summaries vs counters — when to use which?
Answer: Use counters for monotonic events, gauges for instant values, and histograms for latency distributions. Avoid summaries in Prometheus unless you need client-side quantiles.
Explanation: Histograms enable server-side quantiles and exemplars; choose bucket bounds matching SLOs.
Question: How do you prevent high-cardinality blowups?
Answer: Limit label values, avoid user IDs in labels, cap unique dimensions, and sample logs/traces.
Explanation: High-cardinality metrics explode memory and CPU in TSDBs; prefer structured logs for rare-details.
Question: What are best practices for structured logging?
Answer: Use consistent keys (e.g., trace_id
, request_id
, user_id
), redact PII, include error kind and cause, and prefer JSON output.
Explanation: Consistency enables reliable querying and correlation across systems.