Get a quote

Graceful Shutdown in Go Services on ECS: How to Stop Safely Without Dropping Requests

Most Go services on ECS handle SIGTERM wrong. Here is what actually happens during a rolling deploy and how to stop cleanly without dropping in-flight requests or leaving database connections in an undefined state.

Most production incidents during deployments are not caused by bugs in new code. They are caused by the old version of your service stopping in the wrong way. A rolling deploy on ECS sends SIGTERM to running containers, waits a configurable number of seconds, then sends SIGKILL. Everything that happens in that window determines whether your users see errors.

This is a problem we run into constantly while building SaaS backends for clients in Lebanon and across the MENA region. The fix is not complicated, but it requires understanding what ECS actually does during a deploy, and then writing your Go shutdown logic to match.

What ECS does when it stops a task

When ECS decides to stop a task, whether for a rolling deploy, a service update, or a scale-in event, it sends SIGTERM to PID 1 inside the container. It then waits for the container to exit on its own. If the container is still running after the stopTimeout value (default 30 seconds, configurable up to 120), ECS sends SIGKILL.

SIGKILL cannot be caught. It terminates the process immediately. Any in-flight HTTP requests die without a response. Any database transactions in progress are abandoned. Any queued jobs that were mid-execution are lost unless you have a recovery mechanism.

The goal of graceful shutdown is to catch SIGTERM in your application and use those 30 seconds well.

The baseline Go signal handler

Every Go HTTP service should have this skeleton:

func main() {
    srv := &http.Server{
        Addr:    ":8080",
        Handler: router,
    }

    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)

    go func() {
        if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
            log.Fatalf("server error: %v", err)
        }
    }()

    <-quit
    log.Println("shutdown signal received")

    ctx, cancel := context.WithTimeout(context.Background(), 25*time.Second)
    defer cancel()

    if err := srv.Shutdown(ctx); err != nil {
        log.Fatalf("shutdown failed: %v", err)
    }
    log.Println("server stopped cleanly")
}

http.Server.Shutdown stops accepting new connections and waits for active connections to finish. The 25-second timeout gives in-flight requests time to complete while staying inside the ECS 30-second window.

This baseline works for simple HTTP servers. But most production SaaS services are more complex than that.

What breaks in real production services

Background workers running in goroutines. If you have goroutines processing queue items, running cron-like jobs, or flushing buffers, srv.Shutdown() does not know about them. They keep running after the HTTP server closes, and then SIGKILL hits them mid-execution.

Database connection pools. sql.DB from the standard library does not close automatically when the HTTP server shuts down. If you close the pool before requests finish, handlers trying to acquire connections will fail. If you never close it, the database sees abandoned connections until the TCP timeout fires.

Open file handles and buffers. Logging systems that write to files or stream to external collectors buffer data. An abrupt termination loses the last batch of log lines, which are often the most useful ones: the lines that explain why the service was receiving traffic at the moment it died.

Third-party client connections. Redis, Elasticsearch, S3 clients all maintain connection pools. A hard kill while a client is writing to Redis can leave keys in an inconsistent state if you are using pipelines or transactions.

Building a shutdown coordinator

The pattern we use in Go SaaS backends is a shutdown coordinator that orchestrates all subsystem shutdowns in the correct order:

type ShutdownCoordinator struct {
    mu      sync.Mutex
    done    chan struct{}
    wg      sync.WaitGroup
}

func NewShutdownCoordinator() *ShutdownCoordinator {
    return &ShutdownCoordinator{done: make(chan struct{})}
}

func (sc *ShutdownCoordinator) Done() <-chan struct{} {
    return sc.done
}

func (sc *ShutdownCoordinator) Register() func() {
    sc.wg.Add(1)
    return sc.wg.Done
}

func (sc *ShutdownCoordinator) Shutdown(timeout time.Duration) {
    sc.mu.Lock()
    close(sc.done)
    sc.mu.Unlock()

    c := make(chan struct{})
    go func() {
        sc.wg.Wait()
        close(c)
    }()

    select {
    case <-c:
        log.Println("all subsystems stopped")
    case <-time.After(timeout):
        log.Println("shutdown timeout reached")
    }
}

Background workers receive the Done() channel and finish their current iteration when it closes:

func runWorker(sc *ShutdownCoordinator, db *sql.DB) {
    done := sc.Register()
    defer done()

    for {
        select {
        case <-sc.Done():
            return
        default:
            processNextBatch(db)
        }
    }
}

Shutdown sequence in main:

<-quit

// Step 1: stop accepting new HTTP requests
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
srv.Shutdown(ctx)

// Step 2: signal background workers to finish their current batch
sc.Shutdown(15 * time.Second)

// Step 3: close the database pool
db.Close()

// Step 4: flush logs
logger.Flush()

The order matters. You stop the HTTP layer first so no new work enters the system. You wait for background workers to finish their current unit of work. Then you close external connections. Then you flush IO.

Configuring ECS for graceful shutdown

In your ECS task definition, set stopTimeout explicitly. The default is 30 seconds, which is often not enough for services that process heavy jobs. You can go up to 120 seconds:

{
  "containerDefinitions": [
    {
      "name": "api",
      "stopTimeout": 60
    }
  ]
}

If you use AWS Application Load Balancer, configure the target group deregistration delay. When ECS deregisters a target before sending SIGTERM, the load balancer needs a moment to stop sending new requests to that container. Set deregistration_delay.timeout_seconds to match your expected longest request duration, typically 15 to 30 seconds for API endpoints.

With both values set correctly, the timeline looks like this:

  1. ECS task receives stop signal
  2. Container is deregistered from the load balancer
  3. ALB drains existing connections over the deregistration delay window
  4. ECS sends SIGTERM to the container
  5. Go application catches SIGTERM and begins shutdown sequence
  6. All in-flight requests complete, workers finish their batch
  7. Database pool closes
  8. Container exits cleanly

Health checks and readiness

An HTTP /healthz endpoint that returns 200 when the service is ready is the most common pattern, but it often misses one important behavior: the endpoint should return a non-200 response during shutdown. This tells the load balancer to stop routing traffic to the instance before SIGTERM even arrives.

var shuttingDown atomic.Bool

func healthHandler(w http.ResponseWriter, r *http.Request) {
    if shuttingDown.Load() {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
}

Set shuttingDown to true as the first action when you receive SIGTERM, before you call srv.Shutdown(). The ALB will see the failing health check and mark the instance as unhealthy, which accelerates deregistration.

Testing shutdown behavior

This is the part most teams skip. Shutdown logic only gets exercised during deploys or failures. A broken shutdown silently drops requests in production, and nobody investigates until the error rate becomes obvious.

We test it like this in local environments:

  1. Start the service
  2. Send a stream of requests that take 3 to 5 seconds each
  3. Send SIGTERM while requests are in-flight
  4. Verify that all in-flight requests complete with correct responses
  5. Verify that the process exits within the expected window
  6. Verify that no database connections are leaked (check with pg_stat_activity)

A simple load test with hey or vegeta running while you kill the process is usually enough to expose problems.

Where things go wrong in practice

The most common issue we see in codebases that come to us for infrastructure review is that signal handling was added as an afterthought. The channel is wired, srv.Shutdown() is called, but nobody thought about the goroutine that runs database migrations on startup and occasionally gets stuck, or the Redis subscriber goroutine that blocks indefinitely waiting for a message.

A second common issue is the ECS stopTimeout being left at the default 30 seconds while the application's shutdown timeout is also 30 seconds. This means any delay anywhere sends SIGKILL before the application has finished. Always set your application timeout to 5 to 10 seconds less than stopTimeout.

A third issue is log flushing. Structured loggers that batch writes to CloudWatch, Datadog, or a log aggregator will lose the last batch when the process dies. Flushing explicitly at shutdown is two lines of code that saves hours of debugging.


Key lessons from production

Stop accepting traffic before stopping work. Signal workers with a channel, not a kill signal. Set ECS stopTimeout explicitly to give yourself enough runway. Test shutdown with load running, not in isolation. Flush logs and close external connections in the right order.

A properly shut-down service is invisible: deploys happen without error spikes, no requests drop, no database connections linger. That invisibility is the goal.

Free PDF Download

Enjoying this article?

Enter your email and get a clean, formatted PDF of this article - free, no spam.

Free. No spam. Unsubscribe any time.

Not sure where to start?

Voxire builds and operates Go backends on ECS for SaaS products across Lebanon and the MENA region. If your deployments are causing error spikes or you want to audit your service's shutdown behavior before it becomes a production incident, reach out.

https://voxire.com/get-a-quote/

Back to blog
Chat on WhatsApp