Observability is the difference between finding the cause of a production incident in 10 minutes and spending three hours reading through log files. This is how we instrument Go SaaS APIs with OpenTelemetry to get the distributed tracing, metrics, and structured logs that actually matter in production.
المراقبة هي الفرق بين إيجاد سبب حادثة إنتاج في 10 دقائق وقضاء ثلاث ساعات في قراءة ملفات السجلات. الفرق الحقيقي بين المراقبة الجيدة والسيئة لا يظهر في الأوقات العادية. يظهر عند الساعة الثانية صباحاً عندما يكون API الخاص بك بطيئاً لمستأجر واحد فقط وتحتاج لمعرفة السبب قبل أن يلغي عميل يدفع اشتراكاً مرتفعاً.
Why is observability different from logging?
Logging answers what happened. Observability answers why it happened and where.
A structured log entry tells you that a request failed with a 500 error. Distributed tracing shows you the sequence of operations that led to that failure: the API handler called the database, the database query took 8 seconds, that query was triggered by a webhook from an external payment processor that sent an unusual payload. The trace connects these events across service boundaries and across time.
For SaaS platforms in Lebanon and the MENA region, where backend services often integrate with regional payment processors, shipping APIs, and SMS gateways, distributed tracing is what makes integration failures debuggable without manually correlating logs across services.
OpenTelemetry is the open standard for instrumenting distributed systems. It defines a common format for traces, metrics, and logs, with exporters for every major observability backend: Grafana, Datadog, New Relic, Honeycomb, Jaeger, and others. Instrumenting with OpenTelemetry means your instrumentation is portable across backends without code changes.
Setting up OpenTelemetry in a Go service
Install the required packages:
go get go.opentelemetry.io/otel
go get go.opentelemetry.io/otel/sdk/trace
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
go get go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp
Initialize the tracer provider on startup:
func initTracer(ctx context.Context, serviceName string) (func(), error) {
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint(os.Getenv("OTEL_EXPORTER_ENDPOINT")),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, err
}
res, err := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceName(serviceName),
semconv.ServiceVersion(os.Getenv("APP_VERSION")),
attribute.String("deployment.environment", os.Getenv("ENVIRONMENT")),
),
)
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(res),
trace.WithSampler(trace.TraceIDRatioBased(0.1)), // sample 10%
)
otel.SetTracerProvider(tp)
return func() { tp.Shutdown(ctx) }, nil
}
Wrap the HTTP server with OpenTelemetry middleware:
handler := otelhttp.NewHandler(
mux,
"api-server",
otelhttp.WithMessageEvents(otelhttp.ReadEvents, otelhttp.WriteEvents),
)
This single middleware call instruments every HTTP handler with a span that captures method, route, status code, and duration automatically.
Adding spans to critical operations
The automatic middleware creates a span per HTTP request. To understand what happens inside a request, add child spans for database queries, external API calls, and business logic operations:
func (s *Service) GetTenantOrders(ctx context.Context, tenantID string) ([]Order, error) {
ctx, span := otel.Tracer("order-service").Start(ctx, "GetTenantOrders")
defer span.End()
span.SetAttributes(
attribute.String("tenant.id", tenantID),
)
orders, err := s.db.QueryOrders(ctx, tenantID)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return nil, err
}
span.SetAttributes(attribute.Int("orders.count", len(orders)))
return orders, nil
}
For database queries, the pgx PostgreSQL driver has native OpenTelemetry support via otelpgx:
tracer := otelpgx.NewTracer()
config, _ := pgx.ParseConfig(databaseURL)
config.Tracer = tracer
db, _ := pgxpool.NewWithConfig(ctx, poolConfig)
With this configuration, every database query generates a child span showing the SQL text, duration, number of rows returned, and whether it hit a row lock. Slow queries become immediately visible in the trace without any additional instrumentation.
Metrics: what to measure
Distributed traces explain individual requests. Metrics tell you about the health of the system over time. OpenTelemetry supports both.
For a Go SaaS API, the minimum metrics to instrument:
var (
requestDuration = otelmetric.Must(meter).NewFloat64Histogram(
"api.request.duration",
metric.WithDescription("HTTP request duration in seconds"),
metric.WithUnit("s"),
)
activeRequests = otelmetric.Must(meter).NewInt64UpDownCounter(
"api.requests.active",
)
dbQueryDuration = otelmetric.Must(meter).NewFloat64Histogram(
"db.query.duration",
metric.WithUnit("s"),
)
tenantJobQueueDepth = otelmetric.Must(meter).NewInt64ObservableGauge(
"queue.depth",
)
)
The most important dimensions to add to every metric are tenant_id and route. This makes it possible to answer questions like: is this p99 latency spike coming from one tenant or from all tenants? Is the slow endpoint /api/v1/orders or /api/v1/reports?
Structured logging with trace correlation
Structured logs become significantly more useful when they include the trace ID and span ID from the active OpenTelemetry trace. This allows you to jump from a log line to the full trace in your observability backend with a single click.
With slog (Go 1.21+):
func LogWithTrace(ctx context.Context, msg string, attrs ...slog.Attr) {
span := trace.SpanFromContext(ctx)
sc := span.SpanContext()
allAttrs := []slog.Attr{
slog.String("trace_id", sc.TraceID().String()),
slog.String("span_id", sc.SpanID().String()),
}
allAttrs = append(allAttrs, attrs...)
slog.LogAttrs(ctx, slog.LevelInfo, msg, allAttrs...)
}
When a tenant reports an issue at a specific time, you can filter logs by tenant_id and time range, find the failing request, copy the trace_id, and jump directly to the distributed trace that shows exactly what happened inside that request.
Sampling strategy: not every trace needs to be recorded
Recording and storing every trace for a SaaS backend that handles thousands of requests per minute is expensive. A 10% sampling rate is the standard starting point: record one in ten traces, chosen randomly.
For SaaS platforms, add head-based sampling rules that override the default rate:
- Always sample if the response status code is 4xx or 5xx
- Always sample if request duration exceeds 2 seconds
- Always sample traces that include a specific tenant_id in an alert
- Sample error-free fast requests at 1% to reduce storage costs
This keeps error traces fully visible while reducing storage costs for healthy traffic by 10x.
Alerting on trace data
Grafana Tempo, Honeycomb, and Datadog all support querying trace data for alerts. The queries that matter most:
- p99 latency by route exceeds 2 seconds for 5 consecutive minutes
- Error rate by tenant exceeds 5% in the last 10 minutes
- Database query duration p99 exceeds 500ms
- Any span with
error = truein thepayment_processorservice in the last 5 minutes
For MENA SaaS platforms integrating with regional payment gateways and government-API services, payment processor span monitoring catches integration failures before they affect revenue.
Key lessons from production
Instrument database queries with trace IDs from day one. The query that looks fast in development and slow in production is explained by the trace, not the log.
Add tenant_id to every span and metric. Without it, aggregate latency numbers mask per-tenant issues, and you will spend hours discovering that a single large tenant is responsible for the apparent system-wide slowdown.
Trace correlation in logs is worth the 10 minutes of setup. The ability to jump from a log line to the full trace cuts incident investigation time in half.
Start with 10% sampling and tune down once you understand which traces are low-value. Error traces should always be sampled at 100% regardless of volume.
Enjoying this article?
Enter your email and get a clean, formatted PDF of this article - free, no spam.
Not sure where to start?
Voxire builds production-grade SaaS backends with full observability instrumentation for teams in Lebanon and the MENA region. If you are building observability into an existing Go service or starting a new project that needs production-grade instrumentation from day one, reach out.
https://voxire.com/get-a-quote/



