Get a quote

Debugging Go Services in Production on AWS ECS: No SSH, No Panic

When a Go service misbehaves on ECS Fargate at 3am, you cannot SSH in. The teams that debug production fast are the ones who built the right observability before they needed it. This is what that setup actually looks like.

When a Go service misbehaves on ECS Fargate at 3am, you cannot SSH in. The container is ephemeral, the task may have already restarted, and the instance underneath is managed by AWS. The teams that debug production fast are the ones who built the right observability before they needed it. This is what that setup actually looks like for SaaS products we run in Lebanon and the MENA region.

Why ECS Fargate makes traditional debugging impossible

With EC2, a panicked engineer can SSH into the machine, attach strace, tail logs directly, or look at process state. With Fargate, there is no underlying machine you control. The container runs, does its work, and disappears. If it crashes, the task gets replaced by ECS automatically.

This is the right tradeoff for production SaaS: auto-recovery, no patching burden, no bastion host sprawl. But it requires you to have already instrumented your service correctly before anything goes wrong.

Layer 1: Structured logs with correlation IDs

The first thing every Go service needs on ECS is structured logging that ships to CloudWatch Logs. fmt.Println gets you nowhere in production.

We use zap for structured logging in Go:

import "go.uber.org/zap"

func NewLogger(env string) (*zap.Logger, error) {
  if env == "production" {
    return zap.NewProduction()
  }
  return zap.NewDevelopment()
}

zap.NewProduction() outputs JSON, which CloudWatch can parse and filter. Every log line includes timestamp, level, message, and any fields you attach.

Correlation IDs are the critical piece. Every request that enters your system should get a unique trace ID that flows through every log line for that request:

func TraceMiddleware(next http.Handler) http.Handler {
  return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
    traceID := r.Header.Get("X-Trace-Id")
    if traceID == "" {
      traceID = uuid.New().String()
    }

    // Attach to response so callers can correlate
    w.Header().Set("X-Trace-Id", traceID)

    // Store in context
    ctx := context.WithValue(r.Context(), contextKeyTraceID, traceID)
    next.ServeHTTP(w, r.WithContext(ctx))
  })
}

func LoggerFromContext(ctx context.Context, base *zap.Logger) *zap.Logger {
  if traceID, ok := ctx.Value(contextKeyTraceID).(string); ok {
    return base.With(zap.String("trace_id", traceID))
  }
  return base
}

Now when a request fails, you search CloudWatch Logs for that trace ID and see every log line from every service that touched that request.

Layer 2: CloudWatch Logs Insights queries

Once you have structured JSON logs in CloudWatch, Logs Insights lets you query them like a database. This is where production debugging actually happens.

Search for all error logs in the last 30 minutes:

fields @timestamp, @message, trace_id, workspace_id
| filter level = "error"
| sort @timestamp desc
| limit 50

Correlate a specific trace across services:

fields @timestamp, @logStream, @message
| filter trace_id = "5d3f2a1b-..."
| sort @timestamp asc

Find the slowest requests in the last hour:

fields @timestamp, trace_id, duration_ms, path
| filter ispresent(duration_ms)
| sort duration_ms desc
| limit 20

The query syntax is limited compared to SQL, but for the most common production debugging scenarios it is sufficient without any additional infrastructure.

Layer 3: ECS Exec for live container access

For cases where logs are not enough and you need to inspect live container state, AWS added ECS Exec. It is essentially SSH into a running Fargate container, routed through SSM.

Enable it in your task definition:

{
  "enableExecuteCommand": true
}

The task role needs SSM permissions:

{
  "Effect": "Allow",
  "Action": [
    "ssmmessages:CreateControlChannel",
    "ssmmessages:CreateDataChannel",
    "ssmmessages:OpenControlChannel",
    "ssmmessages:OpenDataChannel"
  ],
  "Resource": "*"
}

Then connect to a running container:

aws ecs execute-command \
  --cluster production \
  --task arn:aws:ecs:eu-west-1:123456789:task/... \
  --container api \
  --interactive \
  --command "/bin/sh"

Inside the container you can inspect environment variables, check file system state, run curl against internal endpoints, or view process state. This is powerful but should be treated as an emergency tool, not a crutch for missing observability.

Layer 4: Panic recovery with structured logging

Go services on ECS need explicit panic recovery. An unrecovered panic kills the goroutine and the entire service logs nothing useful before ECS restarts it.

func RecoverMiddleware(logger *zap.Logger) func(http.Handler) http.Handler {
  return func(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
      defer func() {
        if rec := recover(); rec != nil {
          // Capture the stack trace
          buf := make([]byte, 4096)
          n := runtime.Stack(buf, false)
          stack := string(buf[:n])

          log := LoggerFromContext(r.Context(), logger)
          log.Error("panic recovered",
            zap.Any("panic", rec),
            zap.String("stack", stack),
            zap.String("path", r.URL.Path),
          )

          http.Error(w, "internal server error", http.StatusInternalServerError)
        }
      }()
      next.ServeHTTP(w, r)
    })
  }
}

This ensures that even when something panics, you get a structured log entry with the stack trace and trace ID. In CloudWatch you can search for panic recovered and find every instance.

Layer 5: Container health checks and restart signals

ECS uses health checks to decide when to restart a task. If your health check endpoint is too simple, ECS will keep routing traffic to a container that is functionally broken but technically alive.

A useful health check in Go validates internal state:

func (s *Server) HealthHandler(w http.ResponseWriter, r *http.Request) {
  // Check database connectivity
  if err := s.db.PingContext(r.Context()); err != nil {
    s.logger.Warn("health check: db ping failed", zap.Error(err))
    w.WriteHeader(http.StatusServiceUnavailable)
    json.NewEncoder(w).Encode(map[string]string{"status": "unhealthy", "reason": "db"})
    return
  }

  // Check Redis connectivity if used
  if err := s.cache.Ping(r.Context()); err != nil {
    s.logger.Warn("health check: cache ping failed", zap.Error(err))
    w.WriteHeader(http.StatusServiceUnavailable)
    json.NewEncoder(w).Encode(map[string]string{"status": "unhealthy", "reason": "cache"})
    return
  }

  json.NewEncoder(w).Encode(map[string]string{"status": "healthy"})
}

When a task starts failing health checks, ECS replaces it. The logs from the failing container are still in CloudWatch, queryable by trace ID or time range.

Layer 6: CloudWatch Container Insights

For resource-level debugging (CPU spike, memory growth, task restarts), enable Container Insights on your ECS cluster:

aws ecs update-cluster-settings \
  --cluster production \
  --settings name=containerInsights,value=enabled

This gives you pre-built dashboards showing:

  • Task count over time
  • CPU and memory utilization per service
  • Network I/O
  • Container restart count

When you see a memory spike at 2am, Container Insights shows you which task and which container, and you can correlate the timestamp with your structured logs to find the request that caused it.

Debugging workflow in practice

When something goes wrong in production, the workflow is:

  1. Check CloudWatch Container Insights for resource anomalies and restart counts
  2. Search CloudWatch Logs for level = "error" in the relevant time window
  3. Pick a failing trace ID from those errors
  4. Search all log groups for that trace ID to reconstruct the full request flow
  5. If you need live container state, use ECS Exec
  6. If the container has already restarted, the logs are still there in CloudWatch

For Lebanese and MENA SaaS products that cannot afford a dedicated SRE team, this setup provides enough visibility to diagnose most production issues without specialized tooling or 24/7 on-call engineering.


Lessons from production

Build observability before you need it. Structured logs, correlation IDs, and panic recovery are not optional in any production Go service on ECS. ECS Exec is a useful escape hatch, but if you are relying on it regularly, your logging is insufficient. Container Insights is cheap and worth enabling on every cluster.


Need help setting up production-grade observability for your Go services?

Voxire builds and maintains cloud infrastructure for SaaS products across Lebanon and the MENA region. If your team is flying blind in production, we can help you build the visibility layer you actually need.

https://voxire.com/get-a-quote/

Free PDF Download

Enjoying this article?

Enter your email and get a clean, formatted PDF of this article - free, no spam.

Free. No spam. Unsubscribe any time.

Back to blog
Chat on WhatsApp