Zero-Downtime PostgreSQL Schema Migrations in a Live SaaS Product

The first time a team runs ALTER TABLE on a production PostgreSQL table with two million rows during business hours, it locks the table, blocks every incoming request, and produces a very educational incident. This is how to avoid repeating that lesson on a live SaaS product in Lebanon or across the MENA region.

The first time a team runs ALTER TABLE on a production PostgreSQL table with two million rows during business hours, it locks the table, blocks every incoming request, and produces a very educational incident. This is how to avoid repeating that lesson on a live SaaS product.

What actually happens when ALTER TABLE runs on a large table

PostgreSQL's behavior during schema changes depends on the operation. Some changes are cheap: adding a nullable column with no default, for example, is a metadata-only change in recent PostgreSQL versions and completes in milliseconds regardless of table size.

Other changes are expensive. Adding a column with a non-null default, adding a constraint like NOT NULL or a CHECK constraint without a default, creating an index without CONCURRENTLY, or rewriting a column's type all require PostgreSQL to scan the full table. During this scan, PostgreSQL holds a lock that blocks concurrent reads and writes.

For a two-million-row table, a migration that requires a full table rewrite can take 30 seconds to several minutes depending on hardware and data distribution. Every API request that touches that table during the migration either waits behind the lock or receives a timeout error. On a SaaS product with Lebanese and Gulf business customers, this is a customer-facing outage.

The lock acquisition itself is the first hazard. Before PostgreSQL can take the lock, it must wait for all current transactions on that table to complete. If a long-running query is in progress when the migration starts, the ALTER TABLE waits. While it waits, subsequent queries queue behind it. A migration that acquires a lock slowly can cause a queue buildup that outlasts the migration itself.

The expand-contract pattern: the foundation of safe migrations

The expand-contract pattern breaks a single schema change into multiple backward-compatible steps that can be deployed independently.

The three phases:

Expand: Add new schema elements without removing old ones. The old code still works. New code can use the new elements.

Migrate data: Backfill the new columns or tables with data from the old structure. Both old and new code continue to work during the backfill.

Contract: Remove the old schema elements once all application code has been updated and deployed to use the new structure.

A concrete example: renaming a column user_name to display_name.

With a naive approach: ALTER TABLE users RENAME COLUMN user_name TO display_name. This breaks every running application instance that still references user_name. Even with a coordinated deployment, there is a window where the old code is still running and the column no longer exists.

With expand-contract:

Migration 1: Add display_name column. Copy data from user_name to display_name via a backfill query or a trigger that keeps them in sync during the transition.
Deploy: Update application code to read and write display_name. Old code still works because user_name still exists.
Migration 2: Drop display_name trigger if added. Drop user_name column. New code never references it.

The key invariant: at every step, the database schema and the running application code are compatible. There is no deployment window where schema and code are mismatched.

Multi-step migration examples for common changes

Adding a NOT NULL column with a default value:

Naive: ALTER TABLE orders ADD COLUMN processed BOOLEAN NOT NULL DEFAULT FALSE.

In older PostgreSQL versions, this rewrites the full table. In PostgreSQL 11 and later, columns with volatile defaults still cause a table rewrite. Immutable defaults with DEFAULT false are fast in PostgreSQL 11+, but NOT NULL enforcement still requires a constraint scan.

Safe approach for large tables:

-- Migration 1: add nullable, no default required, instant
ALTER TABLE orders ADD COLUMN processed BOOLEAN;

-- Migration 2 (separate deploy cycle): backfill in batches
UPDATE orders SET processed = FALSE WHERE processed IS NULL AND id BETWEEN $start AND $end;

-- Migration 3: add NOT NULL constraint only after backfill is complete
ALTER TABLE orders ALTER COLUMN processed SET NOT NULL;

The batch backfill is critical. A single UPDATE orders SET processed = FALSE WHERE processed IS NULL on a table with ten million rows holds a row-level lock on every row it touches. Batching by primary key range keeps transactions short.

Creating an index on a large table:

Standard CREATE INDEX holds a lock that blocks writes for the duration. Use CREATE INDEX CONCURRENTLY instead:

CREATE INDEX CONCURRENTLY idx_orders_tenant_id_created_at
ON orders(tenant_id, created_at DESC);

CONCURRENTLY builds the index while allowing concurrent reads and writes. It takes longer and cannot run inside a transaction block, but it never locks the table for writes.

Note: if a CREATE INDEX CONCURRENTLY operation fails, it leaves an invalid index in the schema. Clean it up with DROP INDEX CONCURRENTLY before retrying.

Setting up golang-migrate for production

golang-migrate manages versioned migration files and tracks which migrations have been applied to which database. For a Go SaaS backend, it integrates cleanly with the startup sequence or as a standalone migration binary.

Directory structure:

internal/db/migrations/
  000001_create_users.up.sql
  000001_create_users.down.sql
  000002_add_tenant_id.up.sql
  000002_add_tenant_id.down.sql
  000003_add_display_name_column.up.sql
  000003_add_display_name_column.down.sql

Each migration has an up and a down file. Down migrations allow rollback. In practice, reversing a migration on a production database is rarely the right move, but having the down migration written forces you to think about what rollback actually means for each change.

Running migrations at service startup:

func runMigrations(db *sql.DB) error {
    m, err := migrate.NewWithDatabaseInstance(
        "file://internal/db/migrations",
        "postgres",
        driver,
    )
    if err != nil {
        return fmt.Errorf("migration init: %w", err)
    }
    if err := m.Up(); err != nil && err != migrate.ErrNoChange {
        return fmt.Errorf("migration run: %w", err)
    }
    return nil
}

Running migrations at startup works for small tables in development. For production migrations that may take minutes, running at startup blocks service initialization. The correct approach for production is running migrations as a separate pre-deploy job, before the new service version is deployed.

Coordinating migrations with service deployments

The deployment sequence for a safe migration:

Run the database migration as a pre-deploy step before the new service version is deployed
Verify the migration completed successfully
Deploy the new service version
If migration and deployment are both expand-phase changes, both can proceed in parallel
Contract-phase migrations (dropping old columns) must run after all service instances running old code have been replaced

This sequence requires that the pre-deploy migration is backward-compatible with the currently running service version. If the migration adds a column that the old code ignores, the old code continues running safely. If the migration removes a column that the old code reads, you have a problem regardless of sequencing.

For multi-region SaaS products serving both Lebanon and Gulf markets from the same database, coordinating migration timing with deployment rollout windows requires explicit planning. A migration that runs at midnight Beirut time is 1am in Saudi Arabia and 2am in the UAE. For B2B SaaS products with Gulf enterprise customers, confirm that the migration window is outside their business hours, not just Lebanon's.

When to use pg_repack for large table changes

pg_repack is a PostgreSQL extension that rewrites tables in the background without holding a full table lock. It is the right tool when you need to:

Add a NOT NULL column with a computed default on a very large table
Remove dead rows from a table that has been heavily updated (table bloat)
Reorder rows to match a clustered index layout

pg_repack creates a new version of the table in the background, applies changes in parallel with live traffic, and performs a brief lockout only for the final cutover. The lockout is typically seconds rather than minutes.

The operational overhead of pg_repack is higher than standard migrations. It requires the extension installed on the database server, additional disk space during the rewrite (the table size approximately doubles temporarily), and monitoring during the operation.

For SaaS products running on AWS RDS or Aurora PostgreSQL, pg_repack must be installed as an RDS extension. It is available in the RDS extension list but must be explicitly enabled.

Rollback strategies that actually work

The expand-contract pattern is itself the rollback strategy for most migrations. Because the database schema and application code are always compatible during expand-phase migrations, a failed deployment can be rolled back to the previous code version without touching the database.

Contract-phase rollbacks (restoring a dropped column) are the genuinely hard case. If you drop a column and then need to roll back, the data is gone unless you have a backup. The correct safeguard is to never perform contract-phase cleanup migrations until you are confident the expand phase has been stable in production for an adequate observation period (typically two to four deployment cycles).

For point-in-time recovery, AWS RDS and Aurora both support PITR with up to 35 days of retention. For Lebanese and MENA SaaS products where PITR may be the final backstop for a bad migration, verify the recovery time for your database size before you need it, not during an incident.

Key lessons from production

The expand-contract pattern eliminates the majority of zero-downtime migration problems. It requires more migrations per schema change, but each migration is safe to run without coordination with application deployments.

CREATE INDEX CONCURRENTLY is non-negotiable for tables with more than a few hundred thousand rows. The performance difference between concurrent and blocking index creation is irrelevant compared to the difference in impact on production traffic.

Batch backfill queries by primary key range. A single UPDATE on millions of rows is a production event. Batching turns it into background work.

Contract-phase cleanups (dropping old columns and tables) should be delayed until the new code has been stable in production across multiple deployment cycles. There is no urgency to remove the old structure.

What actually happens when ALTER TABLE runs on a large table

The expand-contract pattern: the foundation of safe migrations

Multi-step migration examples for common changes

Setting up golang-migrate for production

Coordinating migrations with service deployments

When to use pg_repack for large table changes

Rollback strategies that actually work

Key lessons from production

Not sure where to start?

Keep reading

Feature Flags in Production Go SaaS: Deploying Features Without Restarts

Production Audit Logs in Go SaaS: Design, Storage, and What Most Teams Get Wrong

Go API Testing in Production SaaS: The Patterns That Actually Catch Real Bugs