Circuit Breakers and Graceful Degradation for Go SaaS Backends

Every Go SaaS backend calls external services. In Lebanon and the MENA region, third-party API reliability varies significantly. One slow external dependency can cascade into a full outage. Circuit breakers are the standard solution, and this post covers the implementation in detail.

The Cascade Failure Problem in SaaS Backends

Every Go SaaS backend calls external services. In Lebanon and the MENA region, third-party API reliability varies significantly. Payment gateways go down during peak hours. SMS providers have intermittent failures. Currency exchange APIs become unreachable without warning.

One slow external dependency does not just affect the feature that depends on it. In a typical Go HTTP handler, a 10-second timeout on an SMS gateway call means 10 seconds of goroutine time consumed per request that touches that code path. Under load, goroutines accumulate. Memory climbs. Other requests start timing out waiting for a free goroutine. The cascade has started.

This is not a hypothetical. It is the failure mode we have seen in RTYLR when integrating with regional payment processors and SMS gateways that have unpredictable availability windows.

Circuit breakers are the standard solution. They prevent cascading failures by short-circuiting calls to a failing dependency before they consume resources.

The Three Circuit Breaker States

A circuit breaker is a state machine with three states:

Closed (normal operation) Calls pass through. The breaker counts failures. When failures exceed a threshold within a time window, the breaker trips to Open.

Open (dependency is failing) Calls are rejected immediately without attempting the real call. The caller receives an error instantly rather than waiting for a timeout. After a cooldown period, the breaker transitions to Half-Open.

Half-Open (probing recovery) One request is allowed through. If it succeeds, the breaker transitions back to Closed. If it fails, the breaker returns to Open and resets the cooldown.

The key insight is that Open state eliminates the latency cost of a failing dependency. Instead of waiting 10 seconds for a timeout, you get an error in microseconds. Your goroutine pool stays healthy.

A Minimal Go Implementation

The Go standard library does not include a circuit breaker. Libraries like sony/gobreaker are solid, but understanding the core state machine is valuable before reaching for a dependency.

type State int

const (
    StateClosed   State = iota
    StateOpen
    StateHalfOpen
)

type CircuitBreaker struct {
    mu           sync.Mutex
    state        State
    failures     int
    maxFailures  int
    cooldown     time.Duration
    lastFailure  time.Time
}

func (cb *CircuitBreaker) Call(fn func() error) error {
    cb.mu.Lock()
    defer cb.mu.Unlock()

    switch cb.state {
    case StateOpen:
        if time.Since(cb.lastFailure) < cb.cooldown {
            return ErrCircuitOpen
        }
        cb.state = StateHalfOpen
    case StateClosed, StateHalfOpen:
        // fall through to attempt the call
    }

    cb.mu.Unlock()
    err := fn()
    cb.mu.Lock()

    if err != nil {
        cb.failures++
        cb.lastFailure = time.Now()
        if cb.failures >= cb.maxFailures || cb.state == StateHalfOpen {
            cb.state = StateOpen
        }
        return err
    }

    cb.failures = 0
    cb.state = StateClosed
    return nil
}

This is a simplified version. Production implementations add:

Separate counters for success and failure in the Half-Open probe window
A sliding window for failure rate (not just a raw count)
Metrics emission on state transitions
Per-caller context propagation

For RTYLR, we use sony/gobreaker with custom settings per external service, wrapped in a thin adapter that maps to our internal error types.

Graceful Degradation Strategies

A circuit breaker tells you the dependency is unavailable. Graceful degradation tells you what to do instead. The right strategy depends on the feature.

Cached responses For read-heavy APIs (menu data, pricing, product catalogs), serve the last cached response when the upstream is unavailable. The data may be slightly stale, but the user gets a response. Use Redis or an in-memory cache with a TTL that you are comfortable showing as stale data.

func (s *MenuService) GetMenu(ctx context.Context, tenantID string) (*Menu, error) {
    menu, err := s.cb.Call(func() error {
        return s.fetchFromUpstream(ctx, tenantID)
    })
    if errors.Is(err, ErrCircuitOpen) {
        return s.cache.GetLast(ctx, tenantID) // may return stale
    }
    return menu, err
}

Feature flags Disable non-critical features when their dependencies are unavailable. In RTYLR, SMS order notifications fail open: the order is still recorded, the kitchen still gets the ticket, and the customer notification is queued for retry. The SMS gateway being down does not block order processing.

Reduced functionality mode When a core dependency fails, acknowledge it. Return a structured response that tells the client which features are degraded. API consumers can adapt their UI accordingly.

Bulkheading Goroutine Pools

A circuit breaker prevents goroutine accumulation after the breaker opens. But before it opens, slow requests still accumulate. Bulkheading isolates slow dependencies so they cannot consume the entire goroutine budget.

The implementation in Go is a per-dependency semaphore:

type BulkheadedClient struct {
    sem  chan struct{}
    base ExternalClient
}

func NewBulkheadedClient(base ExternalClient, maxConcurrent int) *BulkheadedClient {
    return &BulkheadedClient{
        sem:  make(chan struct{}, maxConcurrent),
        base: base,
    }
}

func (c *BulkheadedClient) Do(ctx context.Context, req Request) (Response, error) {
    select {
    case c.sem <- struct{}{}:
        defer func() { <-c.sem }()
    case <-ctx.Done():
        return Response{}, ctx.Err()
    default:
        return Response{}, ErrBulkheadFull
    }
    return c.base.Do(ctx, req)
}

ErrBulkheadFull is a fast fail. The caller knows immediately that the dependency is saturated and can handle it (serve from cache, queue for retry, return a 503 to the client). No goroutine pile-up.

In RTYLR, the SMS gateway client has a bulkhead of 10 concurrent calls. The payment processor has a bulkhead of 20. These numbers are derived from the observed p99 response times and our goroutine budget.

Production Example: MENA Payment and SMS Failures

MENA payment processors and SMS gateways have specific failure patterns worth understanding:

Regional DNS issues: Some providers use DNS TTLs that interact badly with Go's default resolver. Connections fail with i/o timeout rather than a clean error. These are hard to distinguish from slow responses.
Burst throttling: SMS providers in the region often enforce burst limits (100 SMS/minute) that are not well-documented. You hit them during peak hours (dinner rush for a restaurant platform like RTYLR), not during testing.
Payment gateway maintenance windows: Announced for 2am but sometimes start early or run long. A circuit breaker that opens at 5 consecutive failures catches this within 5 seconds of the window starting.

Configuration that works for regional providers:

gobreaker.Settings{
    Name:        "sms-gateway",
    MaxRequests: 1,              // probe with 1 request in half-open
    Interval:    30 * time.Second, // reset failure count every 30s
    Timeout:     60 * time.Second, // cooldown before half-open probe
    ReadyToTrip: func(counts gobreaker.Counts) bool {
        return counts.ConsecutiveFailures >= 5
    },
}

Testing Circuit Breakers

Circuit breaker logic is easy to unit test because the state machine is deterministic. Inject a failing function and assert state transitions:

func TestCircuitOpensAfterThreshold(t *testing.T) {
    cb := NewCircuitBreaker(CircuitBreakerConfig{
        MaxFailures: 3,
        Cooldown:    5 * time.Second,
    })

    alwaysFail := func() error { return errors.New("upstream error") }

    for i := 0; i < 3; i++ {
        _ = cb.Call(alwaysFail)
    }

    err := cb.Call(alwaysFail)
    require.ErrorIs(t, err, ErrCircuitOpen)
}

Integration tests should inject a flaky server (use net/http/httptest with a handler that fails a configurable percentage of the time) and verify that the circuit opens, holds, and recovers correctly over real time intervals.

Key Lessons from Production

Wrap every external call. Circuit breakers are only useful if they are applied consistently. If one code path bypasses the breaker and calls the SMS gateway directly, that path can still cascade.
Tune thresholds per service, not globally. A payment processor and a menu-fetch service have different acceptable failure rates. One global threshold fits neither well.
Emit metrics on state transitions. Every Open transition should fire a metric. Alert on it. A circuit breaker that opens silently is invisible until users complain.
Test your degradation paths. The happy path is tested constantly. The degraded path is tested rarely. Write integration tests that force the circuit open and verify the fallback behavior works.
Combine with retries carefully. Retries and circuit breakers interact. Retry with exponential backoff makes sense for transient errors. Retrying into an open circuit is pointless and wastes time. Check circuit state before deciding to retry.

Not sure where to start?

If you are building a Go SaaS backend and want resilience patterns that hold up in production in the MENA region, Voxire can help you design and implement them. We have built and operated these systems across Lebanon and the Gulf. Get in touch at https://voxire.com/get-a-quote/

The Cascade Failure Problem in SaaS Backends

The Three Circuit Breaker States

A Minimal Go Implementation

Graceful Degradation Strategies

Bulkheading Goroutine Pools

Production Example: MENA Payment and SMS Failures

Testing Circuit Breakers

Key Lessons from Production

Not sure where to start?

Keep reading

Feature Flags in Production Go SaaS: Deploying Features Without Restarts

Production Audit Logs in Go SaaS: Design, Storage, and What Most Teams Get Wrong

Go API Testing in Production SaaS: The Patterns That Actually Catch Real Bugs