Get a quote

The Transactional Outbox Pattern in Go: Guaranteed Event Publishing with PostgreSQL

The most common reliability bug in event-driven Go backends is the gap between database writes and message publishing. You write to PostgreSQL, then publish to a queue, but those are two separate operations. If the service crashes between them, the event is lost. The outbox pattern eliminates that gap entirely.

The most common reliability bug in event-driven Go backends is the gap between database writes and message publishing. You write to PostgreSQL, then publish to a queue, but those are two separate operations. If the service crashes between them, the event is lost. The consumer never hears about it. Depending on what that event triggers, the consequences range from stale caches to missing notifications to unpaid invoices.

The transactional outbox pattern eliminates that gap by making event publishing part of the same database transaction as the state change. This is how RTYLR, Voxire's commerce platform running across Lebanon and the Gulf, guarantees that payment events, inventory changes, and order state transitions are never silently dropped.

What the problem actually looks like in production

The naive implementation looks like this in nearly every Go codebase:

func (s *OrderService) ConfirmOrder(ctx context.Context, orderID uuid.UUID) error {
    if err := s.db.UpdateOrderStatus(ctx, orderID, "confirmed"); err != nil {
        return err
    }
    // If the process dies here, the event is lost
    return s.queue.Publish(ctx, OrderConfirmedEvent{OrderID: orderID})
}

This code has a consistency window. The database write succeeds. Then the service restarts before Publish runs. The order is confirmed in PostgreSQL, but the downstream systems that subscribe to order.confirmed never receive the event. They continue operating as if the order is still pending.

At low volume, this is invisible. At scale, or during any deployment or infrastructure hiccup, it produces silent inconsistencies that take hours to diagnose because nothing obviously failed.

How the outbox pattern works

The key insight is that PostgreSQL is a transactional database. If two writes happen in the same transaction, they either both succeed or both fail. The outbox pattern takes advantage of this by writing the event into the same database transaction that modifies application state.

You add an outbox table:

CREATE TABLE outbox_events (
  id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  event_type   TEXT NOT NULL,
  payload      JSONB NOT NULL,
  created_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  processed_at TIMESTAMPTZ,
  attempts     INT NOT NULL DEFAULT 0
);

CREATE INDEX outbox_unprocessed_idx
  ON outbox_events (created_at)
  WHERE processed_at IS NULL;

Your service writes to both tables in a single transaction:

func (s *OrderService) ConfirmOrder(ctx context.Context, orderID uuid.UUID) error {
    return s.db.InTransaction(ctx, func(tx *sql.Tx) error {
        if err := updateOrderStatus(ctx, tx, orderID, "confirmed"); err != nil {
            return err
        }
        payload, _ := json.Marshal(OrderConfirmedPayload{OrderID: orderID})
        _, err := tx.ExecContext(ctx, `
            INSERT INTO outbox_events (event_type, payload)
            VALUES ($1, $2)
        `, "order.confirmed", payload)
        return err
    })
}

Now there is no consistency window. Either both the status update and the outbox event commit, or neither does. The event cannot be lost during the gap between the two writes.

The outbox relay worker

Writing to the outbox is only half of it. You also need a relay process that reads unprocessed events and publishes them to your actual message queue or webhook system.

func (w *OutboxWorker) Run(ctx context.Context) error {
    ticker := time.NewTicker(500 * time.Millisecond)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-ticker.C:
            if err := w.processOneBatch(ctx); err != nil {
                log.Printf("outbox relay error: %v", err)
            }
        }
    }
}

func (w *OutboxWorker) processOneBatch(ctx context.Context) error {
    rows, err := w.db.QueryContext(ctx, `
        SELECT id, event_type, payload
        FROM outbox_events
        WHERE processed_at IS NULL
        ORDER BY created_at
        LIMIT 50
        FOR UPDATE SKIP LOCKED
    `)
    if err != nil {
        return err
    }
    defer rows.Close()

    for rows.Next() {
        var id uuid.UUID
        var eventType string
        var payload json.RawMessage
        if err := rows.Scan(&id, &eventType, &payload); err != nil {
            return err
        }
        if err := w.publisher.Publish(ctx, eventType, payload); err != nil {
            w.db.ExecContext(ctx, `UPDATE outbox_events SET attempts = attempts + 1 WHERE id = $1`, id)
            continue
        }
        w.db.ExecContext(ctx, `UPDATE outbox_events SET processed_at = NOW() WHERE id = $1`, id)
    }
    return rows.Err()
}

The FOR UPDATE SKIP LOCKED clause is critical when running multiple relay workers. Without it, two workers might pick up the same event and publish it twice. SKIP LOCKED causes a worker to skip rows that are locked by another transaction rather than waiting for them.

Handling downstream failures and retries

The relay worker above increments attempts when publishing fails but does not give up. That is intentional for transient failures. The downstream queue might be temporarily unavailable, and you want automatic recovery.

You need a failure policy for persistent failures. Add a maximum attempt count:

ALTER TABLE outbox_events ADD COLUMN failed_at TIMESTAMPTZ;

In the relay worker, after a configurable maximum (say, 10 attempts), set failed_at and stop retrying. Failed events should trigger an alert and manual review. They represent cases where the downstream system rejected the event permanently, which usually indicates a schema mismatch or a bug that needs investigation.

For events that are safe to discard after they age out (analytics events, for example), add a discard_after column and skip events past that threshold. Order state change events and payment events are never discardable.

Exactly-once versus at-least-once delivery

The outbox pattern gives you at-least-once delivery: events are guaranteed to be published, but may be published more than once in failure scenarios. If the relay worker publishes successfully but crashes before marking the event as processed, it will publish again on the next run.

Downstream consumers must be idempotent. For order events, this means checking whether an order is already in the target state before applying a state transition. For inventory updates, it means using conditional updates with version numbers rather than absolute increments.

Truly exactly-once delivery requires coordination at the message queue level (deduplication keys, message IDs), which most production systems pair with the outbox pattern. In RTYLR, every published event carries the outbox event UUID as a message ID. Consumers check a seen-events table before processing.

PostgreSQL listen/notify as an alternative relay trigger

Polling the outbox table every 500ms is simple and works well for most workloads. Under high write volume, you can replace polling with PostgreSQL's LISTEN/NOTIFY mechanism:

CREATE OR REPLACE FUNCTION notify_outbox() RETURNS TRIGGER AS $$
BEGIN
  PERFORM pg_notify('outbox_event', NEW.id::TEXT);
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER outbox_notify_trigger
  AFTER INSERT ON outbox_events
  FOR EACH ROW EXECUTE FUNCTION notify_outbox();

The relay worker uses LISTEN outbox_event and wakes up immediately when a new row is inserted rather than waiting for the next polling interval. This reduces average event latency from 250ms (half the polling interval) to near-zero without increasing database load.

Key lessons from production

Running the outbox pattern in RTYLR production across multiple countries revealed three things the documentation does not emphasize:

First, the outbox table will grow indefinitely if you do not archive or delete processed rows. A weekly job that moves events older than 30 days to a separate archive table keeps the working table fast.

Second, the relay worker's polling interval is a dial between latency and database load. 500ms is a reasonable default. For payment events, 100ms is worth the extra queries. For analytics events, 5 seconds is fine.

Third, the attempts column is the most useful operational metric on the table. A sudden spike in attempts for a specific event_type is almost always a consumer deployment that introduced a bug. Alerting on this saved hours of debugging during a RTYLR release that shipped a breaking change to a webhook consumer schema.

Free PDF Download

Enjoying this article?

Enter your email and get a clean, formatted PDF of this article - free, no spam.

Free. No spam. Unsubscribe any time.

Not sure where to start?

If you are building event-driven systems in Go and want to implement reliable publishing without coordinating distributed transactions, Voxire can help you design the right architecture for your system's scale and reliability requirements.

https://voxire.com/get-a-quote/

Back to blog
Chat on WhatsApp