Distributed Tracing

Distributed tracing is a technique used to track requests as they flow through various services in a microservices architecture or a distributed system. It helps provide visibility into how requests are processed, how services interact, and where bottlenecks or failures may occur.

Key Concepts

Trace: Represents the journey of a request as it moves through a system. A single trace is made up of multiple spans.
Span Represents a single unit of work or operation within a service (e.g., a database query or an HTTP call). Contains metadata such as:
- Start time
- Duration
- Operation name
- Tags (e.g., status, error messages)
Trace Context: This is the metadata that links all spans together. It includes information such as trace ID, span ID, parent span ID, and other contextual information like timestamps.
Parent-Child Relationships: Spans are often arranged in a tree structure where one span (the parent) triggers or calls other spans (the children), forming a hierarchy. This parent-child relationship represents how services interact during the lifecycle of a request.
Sampling: To avoid overhead, tracing systems may sample a subset of requests to trace. Not all requests are traced, and sampling allows efficient tracking of important or random requests.

How Distributed Tracing Works

Request Initiation: When a request enters the system, a unique trace ID is generated to represent that request. Each service or microservice that handles this request adds a span to the trace.
Span Creation: Each service that processes the request creates its own span, logs its metadata (e.g., start time, end time, service name), and propagates the trace context (trace ID, span ID) to subsequent services or systems.
Context Propagation: As the request travels through various microservices, the trace context (trace ID, parent span ID) is passed along with the request, so the system can continue building out the trace in other services.
Finalization: Once the request completes (or fails), the last service marks the end of the trace, and all spans are sent to a tracing backend for aggregation and visualization.

Correlation ID vs Trace ID

Correlation ID:
- A generic identifier used primarily for logs.
- Focused on identifying and tracking requests in logs.
Trace ID:
- A specific identifier used in distributed tracing systems.
- Focused on linking spans and representing a request's flow across services.

Complement Each Other

While the terms are related, they serve slightly different purposes. However, correlation IDs can complement trace IDs, particularly in systems that don’t yet have full tracing implemented.

Even with distributed tracing, logs from various services might not automatically include Trace IDs.
By adding a Correlation ID explicitly in the logs, you create a unified view of request flows across logs and traces.
Correlation IDs can complement trace IDs, particularly in systems that don’t yet have full tracing implemented.
Correlation IDs can serve custom tracking needs, such as connecting non-traceable events (e.g., asynchronous message processing) to the original request.
Trace data might not be available if the tracing infrastructure fails or spans are dropped due to sampling. Correlation IDs in logs ensure you still have a fallback for debugging.

Common Tools

Jaeger
OpenTelemetry
Grafana Tempo
AWS X-Ray
Google Cloud Trace

Use Cases

Performance Monitoring
Error Detection and Debugging
Service Dependency Mapping
Request Latency Breakdown
Root Cause Analysis

Example

Here’s implementation to combine Correlation IDs with Trace IDs using OpenTelemetry in a Golang application.

Wrap your HTTP handler to generate and propagate Correlation IDs if one is not already present:

package main

import (
    "context"
    "log"
    "net/http"
    "time"

    "github.com/google/uuid"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
    "go.opentelemetry.io/otel/trace"
)

const correlationIDHeader = "X-Correlation-ID"

func initTracer() func() {

    exporter, err := stdouttrace.New(stdouttrace.WithPrettyPrint())
    if err != nil {
        log.Fatalf("failed to initialize stdout exporter: %v", err)
    }

    resource := resource.NewWithAttributes(
        semconv.SchemaURL,
        semconv.ServiceNameKey.String("example-service"),
    )

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(resource),
    )

    otel.SetTracerProvider(tp)

    return func() {
        ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
        tp.Shutdown(ctx)
        cancel()
    }
}

func middleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        correlationID := r.Header.Get(correlationIDHeader)
        if correlationID == "" {
            correlationID = uuid.New().String()
            r.Header.Set(correlationIDHeader, correlationID)
        }

        // Add Correlation ID to context
        ctx := context.WithValue(r.Context(), correlationIDHeader, correlationID)

        // Include Trace ID in logs for correlation
        tracer := otel.Tracer("example-tracer")
        ctx, span := tracer.Start(ctx, "request")
        defer span.End()

        span.SetAttributes(
            attribute.String(correlationIDHeader, correlationID),
        )

        log.Printf("Correlation ID: %s, Trace ID: %s", correlationID, span.SpanContext().TraceID())

        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

func handler(w http.ResponseWriter, r *http.Request) {
    correlationID := r.Context().Value(correlationIDHeader).(string)
    traceID := trace.SpanFromContext(r.Context()).SpanContext().TraceID().String()

    w.WriteHeader(http.StatusOK)
    w.Write([]byte("Correlation ID: " + correlationID + "\n"))
    w.Write([]byte("Trace ID: " + traceID + "\n"))
}

func main() {
    shutdown := initTracer()
    defer shutdown()

    mux := http.NewServeMux()
    mux.Handle("/", middleware(http.HandlerFunc(handler)))

    log.Println("Starting server on :8080")
    if err := http.ListenAndServe(":8080", mux); err != nil {
        log.Fatalf("server failed: %v", err)
    }
}

How It Works

Correlation ID Handling:
- The middleware checks for a X-Correlation-ID in the request header.
- If missing, it generates a new UUID as the Correlation ID.
Context Propagation:
- The Correlation ID is added to the request context for downstream use.
Trace ID Integration:
- A new trace span is started for the request, and the Trace ID is logged alongside the Correlation ID.
- The trace span is automatically propagated to any downstream services.
Logging:
- Logs include both Correlation ID and Trace ID, making it easy to correlate between logs and traces.

PreviousChaos Engineering NextKubernetes (k8s)

Last updated 6 months ago