GitHub Actions to ECS: Building a Zero-Downtime CI/CD Pipeline for Go Services

Manual deployments are where bugs slip through and where teams lose confidence in their release process. This is how we build GitHub Actions pipelines that take Go services from a git push to a live rolling ECS deployment with zero manual steps and automatic rollback on failure.

Manual deployments are where bugs slip through and where teams lose confidence in their release process. Every manual step is an opportunity for human error, a skipped test, a wrong environment variable, or a deployment to the wrong cluster. For SaaS teams in Lebanon and the MENA region running backend services on AWS ECS, a properly built CI/CD pipeline is not a nice-to-have, it is a prerequisite for shipping with confidence.

This is how we build GitHub Actions pipelines that take Go services from a git push to a live rolling ECS deployment with zero manual steps and automatic rollback on failure.

What the pipeline needs to do

A complete pipeline for a Go service on ECS covers these stages in sequence:

Build and test the Go service
Run the dependency check (ensure no imports are missing from go.mod)
Build a minimal Docker image
Push the image to Amazon ECR
Update the ECS task definition with the new image
Trigger a rolling deployment on the ECS service
Monitor deployment health and roll back automatically on failure

Stage seven is what most basic pipeline tutorials skip, and it is the stage that matters most for production SaaS operations.

Repository structure for a Go service

A well-organized Go service for ECS deployment has a predictable layout:

/
├── .github/
│   └── workflows/
│       ├── test.yml        # runs on every PR
│       └── deploy.yml      # runs on push to main
├── cmd/
│   └── server/
│       └── main.go
├── internal/
├── Dockerfile
├── task-definition.json    # ECS task definition template
└── go.mod

The task-definition.json is a template checked into the repository. The pipeline fills in the image tag at deploy time.

The Dockerfile: building a minimal Go image

Small Docker images build faster, push faster, pull faster on ECS instances, and have a smaller attack surface. The multi-stage build is the standard approach for Go:

FROM golang:1.23-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -ldflags='-w -s' -o server ./cmd/server

FROM gcr.io/distroless/static-debian12
COPY --from=builder /app/server /server
COPY --from=builder /etc/ssl/certs /etc/ssl/certs
USER nonroot:nonroot
ENTRYPOINT ["/server"]

The distroless/static-debian12 base image contains only the Go binary and TLS certificates. No shell, no package manager, no extra libraries. The resulting image is typically 10 to 20 MB, compared to 100 to 200 MB for an ubuntu-based image. On ECS, smaller images mean faster task replacement during deployments.

The -ldflags='-w -s' flags strip the debug symbol table and DWARF data, which reduces binary size by 20 to 30% without affecting runtime behavior.

The test workflow: runs on every pull request

name: Test
on:
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:16-alpine
        env:
          POSTGRES_PASSWORD: test
          POSTGRES_DB: testdb
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
          - 5432:5432

    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-go@v5
        with:
          go-version-file: go.mod
          cache: true

      - name: Run tests
        env:
          DATABASE_URL: postgres://postgres:test@localhost:5432/testdb?sslmode=disable
        run: go test ./... -race -timeout 120s

      - name: Verify go.mod is tidy
        run: |
          go mod tidy
          git diff --exit-code go.mod go.sum

The embedded PostgreSQL service container runs the real database during tests. Testing against a real PostgreSQL instance rather than mocks catches SQL compatibility issues, constraint violations, and transaction behavior that SQLite or in-memory databases miss.

-race enables Go's race detector. Race conditions in Go services show up under concurrent load, not in single-threaded unit tests. Catching them in CI before they reach production is worth the 2x slowdown in test execution.

The deployment workflow

name: Deploy
on:
  push:
    branches: [main]

permissions:
  id-token: write
  contents: read

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ vars.AWS_ACCOUNT_ID }}:role/github-deploy
          aws-region: ${{ vars.AWS_REGION }}

      - name: Login to ECR
        id: ecr-login
        uses: aws-actions/amazon-ecr-login@v2

      - uses: actions/setup-go@v5
        with:
          go-version-file: go.mod
          cache: true

      - name: Run tests
        run: go test ./... -race -timeout 120s

      - name: Build and push image
        id: build
        env:
          REGISTRY: ${{ steps.ecr-login.outputs.registry }}
          REPO: ${{ vars.ECR_REPOSITORY }}
          IMAGE_TAG: ${{ github.sha }}
        run: |
          docker build -t $REGISTRY/$REPO:$IMAGE_TAG -t $REGISTRY/$REPO:latest .
          docker push $REGISTRY/$REPO:$IMAGE_TAG
          docker push $REGISTRY/$REPO:latest
          echo "image=$REGISTRY/$REPO:$IMAGE_TAG" >> $GITHUB_OUTPUT

      - name: Render ECS task definition
        id: task-def
        uses: aws-actions/amazon-ecs-render-task-definition@v1
        with:
          task-definition: task-definition.json
          container-name: api
          image: ${{ steps.build.outputs.image }}

      - name: Deploy to ECS
        uses: aws-actions/amazon-ecs-deploy-task-definition@v2
        with:
          task-definition: ${{ steps.task-def.outputs.task-definition }}
          service: ${{ vars.ECS_SERVICE }}
          cluster: ${{ vars.ECS_CLUSTER }}
          wait-for-service-stability: true
          wait-for-minutes: 10
          force-new-deployment: true

AWS authentication: OIDC instead of long-lived keys

The configuration above uses role-to-assume rather than AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. This is the OIDC (OpenID Connect) approach where GitHub Actions assumes an IAM role directly without storing long-lived credentials in GitHub Secrets.

The benefit is significant: there are no long-lived keys to rotate, no risk of a key leaking from a repository, and the assumed role exists only for the duration of the workflow run.

To configure this, create an IAM OIDC identity provider in AWS for GitHub Actions (one-time setup), then create an IAM role that the GitHub Actions workflows for your specific repository can assume, with permissions scoped to ECR push, ECS task definition registration, and ECS service update.

Zero-downtime rolling deployments

ECS rolling deployments replace tasks one at a time while keeping the service available. The deployment stops and rolls back automatically if new tasks fail their health checks.

For zero-downtime, your ECS service needs:

A properly configured health check. ECS waits for new tasks to pass their health check before terminating old tasks. Your application must respond to GET /health with a 200 within the health check timeout. If the health check endpoint itself requires database access, ensure the database connection is validated on startup.

Minimum healthy percent above zero. Set minimumHealthyPercent to at least 50 in your ECS service configuration. With two tasks, this means ECS will keep at least one task running during the deployment.

Adequate deployment timeout. The wait-for-minutes: 10 in the GitHub Actions step matches ECS's deployment timeout. If your service does not reach stable state within 10 minutes, the deployment is marked as failed and GitHub Actions returns a non-zero exit code.

Automatic rollback on failure

When wait-for-service-stability: true is set and the deployment does not stabilize, the GitHub Actions step fails. ECS itself does not automatically revert to the previous task definition, but you can add a rollback step:

      - name: Rollback on failure
        if: failure()
        run: |
          PREV_TASK_DEF=$(aws ecs describe-services \
            --cluster ${{ vars.ECS_CLUSTER }} \
            --services ${{ vars.ECS_SERVICE }} \
            --query 'services[0].taskDefinition' \
            --output text)
          aws ecs update-service \
            --cluster ${{ vars.ECS_CLUSTER }} \
            --service ${{ vars.ECS_SERVICE }} \
            --task-definition $PREV_TASK_DEF

This step runs only when a previous step has failed, retrieves the currently active task definition (which is the last stable version), and updates the service back to it.

Environment-specific configuration

Environment variables for the running service should come from AWS Systems Manager Parameter Store or Secrets Manager, not from the task definition file checked into the repository. Secrets in the task definition file would be committed to version control.

In the task definition template:

{
  "secrets": [
    {
      "name": "DATABASE_URL",
      "valueFrom": "arn:aws:ssm:eu-west-1:123456789012:parameter/voxire/prod/database_url"
    }
  ]
}

ECS pulls the secret value from SSM at task launch time. Rotating a database password requires updating the SSM parameter and restarting the tasks, not modifying any code or configuration files.

Key lessons from production

OIDC for AWS authentication eliminates the long-lived credential management problem permanently. The setup takes 20 minutes and the operational benefit is significant, particularly for teams across Lebanon and the MENA region where credential rotation practices are inconsistent.

Running tests in the deployment workflow, not just the PR workflow, catches cases where a dependency in the main branch changed between when a PR was reviewed and when it was merged.

Distroless base images are worth the learning curve. The build time difference over ubuntu-based images is meaningful at the frequency teams deploy.

The rollback step should be in every deployment workflow. Without it, a failed deployment leaves the service in a degraded state that requires manual intervention.

What the pipeline needs to do

Repository structure for a Go service

The Dockerfile: building a minimal Go image

The test workflow: runs on every pull request

The deployment workflow

AWS authentication: OIDC instead of long-lived keys

Zero-downtime rolling deployments

Automatic rollback on failure

Environment-specific configuration

Key lessons from production

Not sure where to start?

Keep reading

Multi-Region Deployment for MENA SaaS on AWS

Redis Caching Strategies for High-Read SaaS APIs in Go

Secrets Management for Go Services on AWS ECS: Getting Off .env in Production