API Observability with OpenTelemetry in Go for Production SaaS

Observability is the difference between finding the cause of a production incident in 10 minutes and spending three hours reading through log files. This is how we instrument Go SaaS APIs with OpenTelemetry to get the distributed tracing, metrics, and structured logs that actually matter in production.

المراقبة هي الفرق بين إيجاد سبب حادثة إنتاج في 10 دقائق وقضاء ثلاث ساعات في قراءة ملفات السجلات. الفرق الحقيقي بين المراقبة الجيدة والسيئة لا يظهر في الأوقات العادية. يظهر عند الساعة الثانية صباحاً عندما يكون API الخاص بك بطيئاً لمستأجر واحد فقط وتحتاج لمعرفة السبب قبل أن يلغي عميل يدفع اشتراكاً مرتفعاً.

Why is observability different from logging?

Logging answers what happened. Observability answers why it happened and where.

A structured log entry tells you that a request failed with a 500 error. Distributed tracing shows you the sequence of operations that led to that failure: the API handler called the database, the database query took 8 seconds, that query was triggered by a webhook from an external payment processor that sent an unusual payload. The trace connects these events across service boundaries and across time.

For SaaS platforms in Lebanon and the MENA region, where backend services often integrate with regional payment processors, shipping APIs, and SMS gateways, distributed tracing is what makes integration failures debuggable without manually correlating logs across services.

OpenTelemetry is the open standard for instrumenting distributed systems. It defines a common format for traces, metrics, and logs, with exporters for every major observability backend: Grafana, Datadog, New Relic, Honeycomb, Jaeger, and others. Instrumenting with OpenTelemetry means your instrumentation is portable across backends without code changes.

Setting up OpenTelemetry in a Go service

Install the required packages:

go get go.opentelemetry.io/otel
go get go.opentelemetry.io/otel/sdk/trace
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
go get go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp

Initialize the tracer provider on startup:

func initTracer(ctx context.Context, serviceName string) (func(), error) {
    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint(os.Getenv("OTEL_EXPORTER_ENDPOINT")),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    res, err := resource.New(ctx,
        resource.WithAttributes(
            semconv.ServiceName(serviceName),
            semconv.ServiceVersion(os.Getenv("APP_VERSION")),
            attribute.String("deployment.environment", os.Getenv("ENVIRONMENT")),
        ),
    )

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(res),
        trace.WithSampler(trace.TraceIDRatioBased(0.1)), // sample 10%
    )
    otel.SetTracerProvider(tp)

    return func() { tp.Shutdown(ctx) }, nil
}

Wrap the HTTP server with OpenTelemetry middleware:

handler := otelhttp.NewHandler(
    mux,
    "api-server",
    otelhttp.WithMessageEvents(otelhttp.ReadEvents, otelhttp.WriteEvents),
)

This single middleware call instruments every HTTP handler with a span that captures method, route, status code, and duration automatically.

Adding spans to critical operations

The automatic middleware creates a span per HTTP request. To understand what happens inside a request, add child spans for database queries, external API calls, and business logic operations:

func (s *Service) GetTenantOrders(ctx context.Context, tenantID string) ([]Order, error) {
    ctx, span := otel.Tracer("order-service").Start(ctx, "GetTenantOrders")
    defer span.End()

    span.SetAttributes(
        attribute.String("tenant.id", tenantID),
    )

    orders, err := s.db.QueryOrders(ctx, tenantID)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return nil, err
    }

    span.SetAttributes(attribute.Int("orders.count", len(orders)))
    return orders, nil
}

For database queries, the pgx PostgreSQL driver has native OpenTelemetry support via otelpgx:

tracer := otelpgx.NewTracer()
config, _ := pgx.ParseConfig(databaseURL)
config.Tracer = tracer

db, _ := pgxpool.NewWithConfig(ctx, poolConfig)

With this configuration, every database query generates a child span showing the SQL text, duration, number of rows returned, and whether it hit a row lock. Slow queries become immediately visible in the trace without any additional instrumentation.

Metrics: what to measure

Distributed traces explain individual requests. Metrics tell you about the health of the system over time. OpenTelemetry supports both.

For a Go SaaS API, the minimum metrics to instrument:

var (
    requestDuration = otelmetric.Must(meter).NewFloat64Histogram(
        "api.request.duration",
        metric.WithDescription("HTTP request duration in seconds"),
        metric.WithUnit("s"),
    )
    activeRequests = otelmetric.Must(meter).NewInt64UpDownCounter(
        "api.requests.active",
    )
    dbQueryDuration = otelmetric.Must(meter).NewFloat64Histogram(
        "db.query.duration",
        metric.WithUnit("s"),
    )
    tenantJobQueueDepth = otelmetric.Must(meter).NewInt64ObservableGauge(
        "queue.depth",
    )
)

The most important dimensions to add to every metric are tenant_id and route. This makes it possible to answer questions like: is this p99 latency spike coming from one tenant or from all tenants? Is the slow endpoint /api/v1/orders or /api/v1/reports?

Structured logging with trace correlation

Structured logs become significantly more useful when they include the trace ID and span ID from the active OpenTelemetry trace. This allows you to jump from a log line to the full trace in your observability backend with a single click.

With slog (Go 1.21+):

func LogWithTrace(ctx context.Context, msg string, attrs ...slog.Attr) {
    span := trace.SpanFromContext(ctx)
    sc := span.SpanContext()

    allAttrs := []slog.Attr{
        slog.String("trace_id", sc.TraceID().String()),
        slog.String("span_id", sc.SpanID().String()),
    }
    allAttrs = append(allAttrs, attrs...)

    slog.LogAttrs(ctx, slog.LevelInfo, msg, allAttrs...)
}

When a tenant reports an issue at a specific time, you can filter logs by tenant_id and time range, find the failing request, copy the trace_id, and jump directly to the distributed trace that shows exactly what happened inside that request.

Sampling strategy: not every trace needs to be recorded

Recording and storing every trace for a SaaS backend that handles thousands of requests per minute is expensive. A 10% sampling rate is the standard starting point: record one in ten traces, chosen randomly.

For SaaS platforms, add head-based sampling rules that override the default rate:

Always sample if the response status code is 4xx or 5xx
Always sample if request duration exceeds 2 seconds
Always sample traces that include a specific tenant_id in an alert
Sample error-free fast requests at 1% to reduce storage costs

This keeps error traces fully visible while reducing storage costs for healthy traffic by 10x.

Alerting on trace data

Grafana Tempo, Honeycomb, and Datadog all support querying trace data for alerts. The queries that matter most:

p99 latency by route exceeds 2 seconds for 5 consecutive minutes
Error rate by tenant exceeds 5% in the last 10 minutes
Database query duration p99 exceeds 500ms
Any span with error = true in the payment_processor service in the last 5 minutes

For MENA SaaS platforms integrating with regional payment gateways and government-API services, payment processor span monitoring catches integration failures before they affect revenue.

Key lessons from production

Instrument database queries with trace IDs from day one. The query that looks fast in development and slow in production is explained by the trace, not the log.

Add tenant_id to every span and metric. Without it, aggregate latency numbers mask per-tenant issues, and you will spend hours discovering that a single large tenant is responsible for the apparent system-wide slowdown.

Trace correlation in logs is worth the 10 minutes of setup. The ability to jump from a log line to the full trace cuts incident investigation time in half.

Start with 10% sampling and tune down once you understand which traces are low-value. Error traces should always be sampled at 100% regardless of volume.

Why is observability different from logging?

Setting up OpenTelemetry in a Go service

Adding spans to critical operations

Metrics: what to measure

Structured logging with trace correlation

Sampling strategy: not every trace needs to be recorded

Alerting on trace data

Key lessons from production

Not sure where to start?

Keep reading

Building Real-Time Analytics for Restaurant Chains and Retailers in MENA

Building a SaaS Billing System in Go: What the Payment API Does Not Handle

Restaurant Management System Architecture for MENA: From POS to Reporting