Building a Distributed Job Scheduler in Go Using PostgreSQL Advisory Locks

Most Go SaaS teams reach for Redis or a standalone scheduler for background jobs. There is a cheaper option: PostgreSQL advisory locks with a lightweight Go scheduler loop. This is what RTYLR uses in production for daily reporting, tenant data sync, and cleanup tasks.

Why a Separate Scheduler Adds Operational Cost

Most Go SaaS teams reach for Redis-backed queues or standalone schedulers like Sidekiq or Temporal when they need background job scheduling. There is a cheaper option: PostgreSQL advisory locks combined with a lightweight Go scheduler. At Voxire, RTYLR uses this pattern for scheduled reporting, tenant data sync, and daily cleanup tasks.

Before explaining the implementation, it is worth understanding what you are trading away when you add a dedicated scheduler service.

A separate scheduler process means:

Another deployment artifact to build, containerize, and monitor
A distributed coordination problem: which instance of your scheduler actually runs a given job?
A Redis dependency purely for distributed locking, adding infrastructure cost and a new failure mode
An additional surface area for operational issues at 3am

If your job requirements are modest (tens of jobs, minute-level precision), PostgreSQL already has everything you need.

How PostgreSQL Advisory Locks Work

Advisory locks are application-level locks that PostgreSQL tracks but does not enforce automatically. Your application calls pg_try_advisory_lock(key bigint) and PostgreSQL returns true if the lock was acquired, false if another session already holds it.

Two modes matter here:

Session-level locks: held until the connection closes or you explicitly call pg_advisory_unlock. Survive transaction rollbacks.
Transaction-level locks: released automatically when the transaction ends. Acquired with pg_try_advisory_xact_lock.

For a job scheduler, session-level locks are the right choice. You want the lock held for the duration of the job, not a transaction.

The key is a bigint. You can use hashtext('job_name')::bigint to derive a stable numeric key from a string job name, or maintain a static constant per job type.

When multiple Go workers start and each tries to acquire the same advisory lock, exactly one succeeds. The rest skip execution. No external coordination layer required.

The Scheduler Loop in Go

The core loop is a time.Ticker that fires every minute (or whatever your minimum resolution is). On each tick, the worker tries to acquire the advisory lock for each job that is due.

func (s *Scheduler) Run(ctx context.Context) {
    ticker := time.NewTicker(time.Minute)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            s.runDueJobs(ctx)
        }
    }
}

runDueJobs queries the job_runs table for jobs where next_run <= now(), then attempts to acquire the advisory lock before dispatching each one.

func (s *Scheduler) runDueJobs(ctx context.Context) {
    jobs, err := s.store.DueJobs(ctx)
    if err != nil {
        s.logger.Error("failed to query due jobs", "err", err)
        return
    }

    for _, job := range jobs {
        go s.tryRunJob(ctx, job)
    }
}

Each job runs in its own goroutine. The advisory lock acquisition happens inside tryRunJob before any work begins.

func (s *Scheduler) tryRunJob(ctx context.Context, job JobRecord) {
    conn, err := s.pool.Acquire(ctx)
    if err != nil {
        return
    }
    defer conn.Release()

    var acquired bool
    err = conn.QueryRow(ctx,
        "SELECT pg_try_advisory_lock($1)",
        job.LockKey,
    ).Scan(&acquired)
    if err != nil || !acquired {
        return // another worker has this job
    }
    // Lock is now session-level on this connection.
    // Keep conn alive for the duration of the job.

    jobCtx, cancel := context.WithTimeout(ctx, job.Timeout)
    defer cancel()

    if err := s.handlers[job.Name](jobCtx); err != nil {
        s.logger.Error("job failed", "job", job.Name, "err", err)
        s.store.MarkFailed(ctx, job.ID, err.Error())
    } else {
        s.store.MarkComplete(ctx, job.ID)
    }
}

The connection holding the advisory lock is kept alive for the full job duration. When conn.Release() runs (via defer), PostgreSQL automatically releases the session-level lock.

Defining Jobs in a Table

The job_runs table stores job definitions and execution history. A minimal schema:

CREATE TABLE job_runs (
    id          BIGSERIAL PRIMARY KEY,
    job_name    TEXT NOT NULL,
    lock_key    BIGINT NOT NULL,
    last_run    TIMESTAMPTZ,
    next_run    TIMESTAMPTZ NOT NULL,
    status      TEXT NOT NULL DEFAULT 'pending',
    error_msg   TEXT,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX ON job_runs (next_run) WHERE status = 'pending';

Each row represents one scheduled execution. After a job completes, the scheduler inserts a new row with the updated next_run timestamp based on the job's interval.

For RTYLR, the recurring jobs are:

Tenant daily report generation (per restaurant, per night)
Stale session cleanup (every 6 hours)
External inventory sync (every 30 minutes for connected POS integrations)

Each job type has a handler registered at startup:

s.RegisterHandler("daily_report", reportHandler.Run)
s.RegisterHandler("session_cleanup", sessionHandler.Run)
s.RegisterHandler("inventory_sync", inventoryHandler.Run)

Handling Missed Jobs and Catchup Logic

When a deployment takes 2 minutes or a server restarts, some jobs miss their scheduled window. Naive schedulers skip them. That causes silent data gaps.

The DueJobs query uses a lookback window to catch missed executions:

SELECT * FROM job_runs
WHERE next_run <= now()
  AND next_run >= now() - INTERVAL '1 hour'
  AND status = 'pending'
ORDER BY next_run ASC;

The 1 hour window is configurable per job type. Jobs with tighter SLAs (inventory sync) have a 10-minute lookback. Nightly reports have a 6-hour window.

For jobs that must not run twice (idempotency is not guaranteed), the advisory lock is the safety mechanism. Even if two workers both see a missed job in the catchup window, only one acquires the lock and runs it.

Idempotent jobs (cleanup tasks) can tolerate running twice, so their catchup window is more generous.

Graceful Shutdown and Timeout Enforcement

Each job runs with a context.WithTimeout. This is non-negotiable. Without it, a hung job holds the advisory lock forever and blocks future runs.

const defaultJobTimeout = 5 * time.Minute

jobCtx, cancel := context.WithTimeout(ctx, defaultJobTimeout)
defer cancel()

Per-job timeouts are stored in the job_runs table and pulled at dispatch time, allowing different timeouts for different jobs without code changes.

For graceful shutdown, the scheduler listens for SIGTERM and SIGINT:

ctx, stop := signal.NotifyContext(context.Background(),
    syscall.SIGTERM,
    syscall.SIGINT,
)
defer stop()

scheduler.Run(ctx)

When the context is cancelled, the ticker stops and no new jobs are dispatched. Jobs already running continue until their own context expires (job timeout) or completes naturally. The process exits cleanly after the last in-flight job finishes.

Key Lessons from Production

Running this in RTYLR across multiple tenant deployments has surfaced a few lessons worth passing on:

Lock key collisions are subtle bugs. Using hashtext('job_name')::bigint can theoretically collide. In practice we have not seen it, but storing the lock key explicitly in the table removes ambiguity.
The connection pool matters. Advisory locks are connection-scoped. If your pool recycles connections aggressively, the lock drops mid-job. Configure MaxConnIdleTime generously or use a dedicated connection for lock-holding.
Log lock contention. When pg_try_advisory_lock returns false, log it at debug level with the job name and worker ID. During incidents, this trace tells you which worker actually ran the job.
Do not run the scheduler in every web pod. In ECS, designate one task as the scheduler role using an environment variable flag and use a separate ECS service for it. This avoids running duplicate scheduler instances that all compete for locks on every tick.
Catchup logic needs a max window. Without an upper bound on the lookback, a job that was stuck for days can suddenly trigger dozens of missed runs. Always cap the catchup window.

This pattern handles everything RTYLR needs for background processing without Redis, without a separate scheduler binary, and without a separate coordination service. The total implementation is under 400 lines of Go and a single PostgreSQL table.

Not sure where to start?

If you are building a SaaS product and need a backend architecture that handles scheduling, multi-tenancy, and operational simplicity, Voxire can help you design and build it. We have done this across multiple production systems serving restaurants and retail businesses across Lebanon and the Gulf. Reach out at https://voxire.com/get-a-quote/