Building a Reliable Webhook Delivery System in Go for SaaS Platforms

Webhooks look simple until they fail silently at 3am and a paying customer cannot reconcile their accounting data. This is how we design and operate webhook delivery systems that actually work in production across SaaS platforms serving Lebanon and the MENA region.

Webhooks look simple until they fail silently at 3am and a paying customer cannot reconcile their accounting data. The interface is just an HTTP POST to a URL the customer registered. The hard part is everything around that POST: what happens when their server is down, when you send the same event twice, when their endpoint returns a 200 but discards the body, or when your outbox grows faster than your delivery workers can clear it.

This is how we design and operate webhook delivery systems in Go for SaaS platforms serving Lebanon and the MENA region.

Why is webhook delivery harder than a simple HTTP POST?

The fundamental problem is that your system generates events synchronously, but webhook delivery is inherently asynchronous and unreliable. The customer's server might be down for maintenance, overwhelmed by traffic, behind a firewall that blocks your IP range, or simply slow. Your transaction cannot wait for their HTTP response.

Naive webhook implementations call the customer's endpoint inline inside a database transaction. This creates a hard coupling between your system's correctness and an external party's availability. A customer endpoint that takes 30 seconds to respond will hold your database connection for 30 seconds. An endpoint that never responds will cause your transaction to time out.

The correct abstraction separates event generation from delivery: write the event to an outbox table inside the same transaction that creates the business event, then deliver asynchronously from a background worker pool.

The outbox pattern: writing events atomically

The outbox pattern stores webhook events in the same database transaction as the business event that triggers them. This guarantees you never generate a business event without a corresponding webhook record, and never create a webhook record without a corresponding business event.

In PostgreSQL:

CREATE TABLE webhook_events (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id   UUID NOT NULL,
    event_type  TEXT NOT NULL,
    payload     JSONB NOT NULL,
    created_at  TIMESTAMPTZ DEFAULT NOW(),
    status      TEXT NOT NULL DEFAULT 'pending',
    attempts    INT NOT NULL DEFAULT 0,
    last_error  TEXT,
    next_retry  TIMESTAMPTZ DEFAULT NOW(),
    delivered_at TIMESTAMPTZ
);

CREATE INDEX idx_webhook_events_pending
    ON webhook_events (next_retry)
    WHERE status = 'pending';

Inside the Go handler that processes a business operation:

func (s *Service) ProcessOrder(ctx context.Context, order Order) error {
    return s.db.BeginTxFunc(ctx, func(tx *sqlx.Tx) error {
        if err := insertOrder(ctx, tx, order); err != nil {
            return err
        }
        payload, _ := json.Marshal(OrderCreatedEvent{OrderID: order.ID, ...})
        _, err := tx.ExecContext(ctx, `
            INSERT INTO webhook_events (tenant_id, event_type, payload)
            VALUES ($1, 'order.created', $2)
        `, order.TenantID, payload)
        return err
    })
}

The webhook event is now guaranteed to exist if and only if the order was created.

Delivery workers: retry logic and exponential backoff

The delivery worker polls the outbox for pending events and attempts delivery. The key design decision is how to handle failures.

Exponential backoff with jitter is the standard approach. An immediate retry of a failed endpoint usually fails again for the same reason. Spacing retries with increasing delays gives the customer's system time to recover while reducing load on a struggling endpoint.

A simple backoff schedule:

Attempt 1: immediate
Attempt 2: 30 seconds
Attempt 3: 5 minutes
Attempt 4: 30 minutes
Attempt 5: 2 hours
Attempt 6+: 24 hours, then mark failed

In Go:

func nextRetryDelay(attempt int) time.Duration {
    delays := []time.Duration{
        0,
        30 * time.Second,
        5 * time.Minute,
        30 * time.Minute,
        2 * time.Hour,
        24 * time.Hour,
    }
    if attempt < len(delays) {
        return delays[attempt]
    }
    return 24 * time.Hour
}

The worker loop claims events atomically to prevent multiple workers from delivering the same event:

func (w *Worker) claimEvent(ctx context.Context) (*WebhookEvent, error) {
    var event WebhookEvent
    err := w.db.QueryRowContext(ctx, `
        UPDATE webhook_events
        SET status = 'processing'
        WHERE id = (
            SELECT id FROM webhook_events
            WHERE status = 'pending'
              AND next_retry <= NOW()
            ORDER BY next_retry
            LIMIT 1
            FOR UPDATE SKIP LOCKED
        )
        RETURNING *
    `).Scan(&event.ID, &event.TenantID, ...)
    return &event, err
}

FOR UPDATE SKIP LOCKED is the key: multiple concurrent workers will each claim a different event without blocking each other.

Idempotency: delivering exactly once

HTTP is not reliable. A webhook POST can succeed on the customer's end while the response is lost in transit, causing your system to retry a successfully delivered event. Customers need a way to detect and discard duplicate deliveries.

The standard approach is a stable event ID that you include in every delivery. Include it in the request headers:

X-Voxire-Event-ID: 7f3a4b2c-...
X-Voxire-Delivery-ID: d8e1f9a0-...

The Event ID is stable across retries. The Delivery ID is unique per delivery attempt. Customers who want to deduplicate on their end check the Event ID and discard any event they have already processed.

This is the same pattern Stripe uses: their id field on every event is stable, and their documentation explicitly tells customers to use it for idempotency.

Signature verification: proving the delivery is from you

Customers should verify that webhook deliveries are genuinely from your system, not from an attacker who knows their endpoint URL.

The standard approach is HMAC-SHA256 signing. For each delivery, compute a signature over the request body using a secret key that only your system and the customer share:

func sign(payload []byte, secret string) string {
    mac := hmac.New(sha256.New, []byte(secret))
    mac.Write(payload)
    return "sha256=" + hex.EncodeToString(mac.Sum(nil))
}

Include the signature in the request headers:

X-Voxire-Signature: sha256=a4b8c2d1...

Customers verify on their end by computing the same HMAC with their stored secret and comparing to the header value. A constant-time comparison prevents timing attacks.

Each tenant gets their own webhook signing secret, generated when they register their endpoint. Secrets should be rotatable: allow tenants to rotate their secret and give them a grace period where both the old and new signature are sent, so they can update their verification logic without downtime.

Endpoint management and health tracking

Not all customer endpoints are equal. Some are flaky and fail 20% of the time. Some are permanently offline. Delivering aggressively to a dead endpoint wastes compute and queue capacity.

Maintain a health state per endpoint. After consecutive failures, circuit-break the endpoint: stop attempting delivery until a cooldown period passes, then send a single probe. If the probe succeeds, resume normal delivery. If it fails, extend the cooldown.

A simple state machine:

ACTIVE: deliver normally
DEGRADED: slow retries, monitor carefully
CIRCUIT_OPEN: no deliveries, wait for cooldown
DISABLED: tenant explicitly disabled the endpoint

For SaaS platforms in the MENA region, circuit-breaking is particularly important during regional internet maintenance windows and Ramadan nights when traffic patterns shift significantly.

Observability: knowing what is happening

Webhook delivery is a background operation. Without observability, you will not know about failures until customers complain.

Minimum metrics to track per tenant:

Delivery success rate over the last 24 hours
Current pending queue depth
Average delivery latency
Consecutive failure count

Structured logs on every delivery attempt with outcome, latency, HTTP status code, and tenant ID make it possible to investigate specific customer complaints quickly.

A webhook delivery dashboard showing per-tenant delivery health is table-stakes for any SaaS product that customers integrate deeply into their operations. Lebanese and Gulf enterprise customers in particular expect a delivery log they can inspect when something looks wrong in their systems.

Key lessons from production

The outbox pattern eliminates the most dangerous failure mode: losing events because a network call failed inside a transaction. Always write events to your database before attempting delivery.

FOR UPDATE SKIP LOCKED is the correct PostgreSQL primitive for distributing work across multiple workers without coordination overhead. Use it from day one.

Sign every delivery with HMAC-SHA256. Customers who integrate deeply will ask for it, and retrofitting signature verification after the fact is painful.

Circuit-breaking per endpoint prevents a handful of consistently-failing endpoints from consuming disproportionate queue capacity.

Build a per-tenant delivery log into your product from the beginning. The first time a customer calls support because their system is out of sync, you will be grateful you can show them exactly what was delivered and when.

Why is webhook delivery harder than a simple HTTP POST?

The outbox pattern: writing events atomically

Delivery workers: retry logic and exponential backoff

Idempotency: delivering exactly once

Signature verification: proving the delivery is from you

Endpoint management and health tracking

Observability: knowing what is happening

Key lessons from production

Not sure where to start?

Keep reading

Feature Flags in Production Go SaaS: Deploying Features Without Restarts

Production Audit Logs in Go SaaS: Design, Storage, and What Most Teams Get Wrong

Go API Testing in Production SaaS: The Patterns That Actually Catch Real Bugs