Feature Flags in Production Go SaaS: Deploying Features Without Restarts

Most Go SaaS teams deploy new features in one of two ways: flip everything at once and hope nothing breaks, or maintain long-lived branches that eventually diverge enough to cause real pain at merge time. Feature flags solve both problems by decoupling code deployment from feature activation.

This is how RTYLR, Voxire's restaurant and retail platform used across Lebanon and the Gulf, handles incremental rollouts across tenant accounts without requiring service restarts or coordinated deployments.

Why environment variables are not enough

The first instinct when implementing feature flags is to use environment variables. Set ENABLE_NEW_CHECKOUT=true and redeploy. This works for the first flag. It breaks down around flag number five.

Environment variable flags require a service restart to change, which means any rollback requires a deployment. They cannot be scoped to individual tenants. They have no audit trail. They leak into your config files and create environment drift between production, staging, and local development.

The correct solution for any SaaS with more than two tenants and more than a handful of flags is a database-backed flag system. Flags live in PostgreSQL. Your Go services read them at request time. Changing a flag takes milliseconds and leaves a record.

The data model

The schema needs to answer two questions for every flag: is this flag on globally, and is it on for this specific tenant?

CREATE TABLE feature_flags (
  name        TEXT PRIMARY KEY,
  enabled     BOOLEAN NOT NULL DEFAULT FALSE,
  description TEXT,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE feature_flag_overrides (
  flag_name   TEXT NOT NULL REFERENCES feature_flags(name) ON DELETE CASCADE,
  tenant_id   UUID NOT NULL,
  enabled     BOOLEAN NOT NULL,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  created_by  TEXT,
  PRIMARY KEY (flag_name, tenant_id)
);

The feature_flags table holds the default state for every flag. The feature_flag_overrides table holds tenant-specific overrides. Resolution is straightforward: if an override exists for a tenant, use it. Otherwise, use the global default.

For percentage rollouts, add a rollout_percentage column (0-100, integer) to feature_flags. For a given tenant, hash the tenant ID against the flag name and check whether the hash falls within the rollout percentage. This gives consistent behavior: a tenant that is in a 20% rollout stays in it across every request.

The flag resolver in Go

The resolver lives in a single struct that gets instantiated at startup and injected into handlers.

type FlagResolver struct {
    db    *sql.DB
    cache sync.Map
    ttl   time.Duration
}

type cachedFlag struct {
    enabled   bool
    expiresAt time.Time
}

func (r *FlagResolver) IsEnabled(ctx context.Context, name string, tenantID uuid.UUID) (bool, error) {
    cacheKey := name + ":" + tenantID.String()
    if v, ok := r.cache.Load(cacheKey); ok {
        cf := v.(cachedFlag)
        if time.Now().Before(cf.expiresAt) {
            return cf.enabled, nil
        }
    }

    enabled, err := r.resolve(ctx, name, tenantID)
    if err != nil {
        return false, err
    }

    r.cache.Store(cacheKey, cachedFlag{
        enabled:   enabled,
        expiresAt: time.Now().Add(r.ttl),
    })
    return enabled, nil
}

func (r *FlagResolver) resolve(ctx context.Context, name string, tenantID uuid.UUID) (bool, error) {
    var enabled bool
    err := r.db.QueryRowContext(ctx, `
        SELECT COALESCE(
            (SELECT enabled FROM feature_flag_overrides
             WHERE flag_name = $1 AND tenant_id = $2),
            (SELECT enabled FROM feature_flags WHERE name = $1),
            false
        )
    `, name, tenantID).Scan(&enabled)
    return enabled, err
}

A short TTL of 30 seconds means flag changes propagate within half a minute without hitting the database on every request. For immediate propagation when a flag is changed, add a simple cache invalidation endpoint that clears the sync.Map entry for the affected flag.

Using flags in handlers

Flags should be checked as early as possible in the request path, not buried inside business logic. A handler that checks a flag mid-execution and returns partial results is harder to reason about than one that fences the entire feature at the top.

func (h *OrderHandler) CreateOrder(w http.ResponseWriter, r *http.Request) {
    tenantID := middleware.TenantFromContext(r.Context())

    splitBilling, err := h.flags.IsEnabled(r.Context(), "split_billing", tenantID)
    if err != nil {
        http.Error(w, "internal error", http.StatusInternalServerError)
        return
    }

    if splitBilling {
        h.createOrderWithSplitBilling(w, r, tenantID)
        return
    }
    h.createOrderLegacy(w, r, tenantID)
}

Keep the two code paths alive and separately testable until the flag is fully rolled out. Only after removing the flag from the override table and confirming no tenant depends on the old path do you delete the legacy code.

Gradual rollouts across RTYLR tenants

In the RTYLR context, a new feature might be ready but has only been tested against Beirut-area restaurant configurations. The risk profile is different for a restaurant in Riyadh with different menu structure, printer setup, and staff workflows.

The override table makes this simple. The engineering team enables the flag for a single trusted pilot tenant first. After one week of clean operation, a list of ten tenants gets the override. After a month, the global default flips to enabled, overrides are cleaned up, and the old code path is deleted.

This rollout pattern does not require any external service, no LaunchDarkly subscription, no Unleash instance. It runs inside the same PostgreSQL database the rest of the system uses.

Flag lifecycle and preventing flag debt

The most common failure mode with feature flags is not creating them, it is never cleaning them up. After six months of production operation, you accumulate fifty flags. Twenty of them are fully rolled out but still in the codebase. Ten more are for features that were abandoned. The remaining twenty are genuinely active.

Two rules prevent this from becoming unmanageable:

First, name flags for the feature lifecycle, not the feature itself. new_checkout_flow is worse than checkout_v2. The version number signals that at some point, v2 becomes the default and the flag gets deleted. enable_split_billing_beta signals that at some point, beta ends.

Second, add a scheduled_removal_date column to feature_flags. At the start of every sprint, run a query against flags whose removal date has passed. Each one is a cleanup ticket. Teams that skip this step end up with flag debt that causes genuine bugs when old flags interact unexpectedly with new ones.

Key lessons from production

At RTYLR, the shift to database-backed flags changed how the team ships across Lebanon and Gulf deployments. Three things stood out:

First, rollbacks that previously required a redeployment now happen in under a minute by flipping a flag. Incidents that caused 20-minute outages now close in two minutes.

Second, feature flags forced the team to write two separate code paths, which ironically improved test coverage. Both paths get tested independently.

Third, the flag resolver's cache TTL is a tunable knob. During normal operation, 30 seconds is fine. During an active incident, the TTL is dropped to zero so operators can iterate on flag state without waiting.