Background jobs handle the work your HTTP handlers cannot: sending emails, processing uploads, syncing external APIs, generating reports, and running scheduled tasks. Building this correctly from the start prevents a class of production incidents that are difficult to debug and expensive to fix at scale.
أنظمة معالجة المهام الخلفية تتولى العمل الذي لا تستطيع معالجات HTTP تنفيذه: إرسال البريد الإلكتروني، معالجة الملفات المرفوعة، مزامنة APIs خارجية، توليد التقارير، وتشغيل المهام المجدولة. بناء هذا بشكل صحيح من البداية يمنع فئة من حوادث الإنتاج يصعب تصحيحها ويكلف إصلاحها الكثير عند التوسع.
هذا الدليل يشرح كيف نصمم ونشغل أنظمة طوابير المهام الخلفية في Go مع Redis لمنصات SaaS في لبنان ومنطقة الشرق الأوسط.
What makes background job systems fail in production?
The most common production failure modes for background job systems are:
Job loss on worker crash. A worker pulls a job from the queue and begins processing. Before the job completes, the worker crashes. Without proper acknowledgment mechanics, the job is lost silently. The user never receives their invoice email, and nobody notices until they call support.
Thundering herd after outage. A downstream service goes down for 30 minutes. Jobs accumulate in the queue. When the service recovers, thousands of jobs execute simultaneously, overwhelming the recovered service and triggering the outage again. Without rate limiting and concurrency controls, recovery becomes self-defeating.
Silent failure accumulation. Failed jobs are retried, fail again, and accumulate in a dead letter queue that nobody monitors. Over time, thousands of failed jobs represent data that was never processed, and the team discovers this during an audit, not in real time.
Priority inversion. Low-priority background exports block high-priority transactional emails because they share the same queue and the same worker pool. User-facing operations should never wait behind batch processing.
Why Redis for job queues?
For early-stage to mid-stage SaaS products, Redis with a well-designed job library offers the right balance of operational simplicity and correctness guarantees.
Redis provides atomic list operations that make reliable job queuing possible without complex database transactions. The BRPOPLPUSH command atomically moves a job from the pending queue to a processing queue, which enables at-least-once delivery even when workers crash: if a worker holding a job in the processing queue goes down, a reaper process can move those jobs back to the pending queue after a timeout.
For Go specifically, the asynq library is the most production-ready Redis-based job system available. It handles acknowledgment correctly, supports multiple queues with priorities, provides a built-in UI for monitoring, and is actively maintained.
Structuring job handlers in Go
A job handler in asynq is a function that accepts a context and a task, processes the task, and returns an error. Returning an error signals that the job should be retried.
type EmailPayload struct {
TenantID string `json:"tenant_id"`
To string `json:"to"`
Template string `json:"template"`
}
func HandleSendEmail(ctx context.Context, t *asynq.Task) error {
var p EmailPayload
if err := json.Unmarshal(t.Payload(), &p); err != nil {
return fmt.Errorf("unmarshal payload: %w", err)
}
if err := emailService.Send(ctx, p.To, p.Template); err != nil {
return fmt.Errorf("send email: %w", err)
}
return nil
}
The handler is registered with the server:
mux := asynq.NewServeMux()
mux.HandleFunc("email:send", HandleSendEmail)
mux.HandleFunc("report:generate", HandleGenerateReport)
mux.HandleFunc("sync:external_api", HandleSyncExternalAPI)
Queue design: separate by priority and type
A flat queue design where all jobs compete equally is correct only if all jobs have equal importance and equal resource requirements. In practice, this is never true.
For a SaaS backend, a sensible queue structure:
critical(concurrency: 5): password reset emails, payment confirmations, account activationdefault(concurrency: 10): transactional emails, notifications, short-lived sync operationsbulk(concurrency: 3): report generation, data exports, batch operationsscheduled(concurrency: 2): recurring tasks, analytics aggregation, cleanup jobs
srv := asynq.NewServer(
asynq.RedisClientOpt{Addr: redisAddr},
asynq.Config{
Queues: map[string]int{
"critical": 10,
"default": 5,
"bulk": 2,
"scheduled": 1,
},
Concurrency: 20,
},
)
The integer values are relative priorities, not absolute concurrency counts. With Concurrency: 20 and the priorities above, the scheduler allocates proportionally: critical gets roughly 56% of worker capacity, default 28%, bulk 11%, scheduled 5%.
Retry strategy and dead letter queues
Jobs that fail should retry with backoff, but not indefinitely. The retry schedule should account for transient errors (network blips, rate limits) without masking persistent failures (programming errors, corrupted data).
A reasonable default retry schedule:
- Attempt 1: immediately
- Attempt 2: 30 seconds
- Attempt 3: 5 minutes
- Attempt 4: 30 minutes
- Attempt 5: 2 hours
- After 5 attempts: move to dead letter queue, alert on-call
client.Enqueue(
asynq.NewTask("email:send", payload),
asynq.MaxRetry(5),
asynq.Retention(24*time.Hour),
)
The dead letter queue must be monitored. In production systems for MENA SaaS platforms, we set up an alert that fires whenever the dead letter queue exceeds 50 jobs, and a weekly review of dead letter patterns to identify systematic failures.
Idempotency: designing jobs to be safe to retry
At-least-once delivery means a job may execute more than once: due to retry on failure, due to network timeouts where the job completed but the acknowledgment was lost, or due to a worker crash during processing.
Job handlers must be idempotent: running the same job twice must produce the same result as running it once.
For email delivery, this means tracking sent emails in the database and skipping the send if a record already exists:
func HandleSendEmail(ctx context.Context, t *asynq.Task) error {
var p EmailPayload
json.Unmarshal(t.Payload(), &p)
// Check if this exact email has already been sent
sent, err := db.EmailAlreadySent(ctx, p.MessageID)
if err != nil { return err }
if sent { return nil } // idempotent: skip duplicate
if err := emailService.Send(ctx, p); err != nil { return err }
return db.MarkEmailSent(ctx, p.MessageID)
}
For data processing jobs, idempotency often comes naturally: processing the same data twice produces the same database state if you use upsert operations rather than unconditional inserts.
Scheduled jobs: replacing cron
Cron jobs on a single server are a reliability bottleneck: if the server goes down, scheduled jobs do not run. For SaaS products, scheduled work should run in the same distributed worker pool as on-demand jobs.
asynq supports periodic tasks with a cron expression scheduler:
scheduler := asynq.NewScheduler(
asynq.RedisClientOpt{Addr: redisAddr},
&asynq.SchedulerOpts{},
)
scheduler.Register("0 */6 * * *", asynq.NewTask("sync:external_api", nil))
scheduler.Register("0 2 * * *", asynq.NewTask("report:daily_aggregate", nil))
scheduler.Register("*/15 * * * *", asynq.NewTask("cleanup:expired_sessions", nil))
The scheduler enqueues jobs at the specified time. The worker pool picks them up and processes them like any other job. If the scheduler instance goes down, starting it again recovers correctly because it reads its schedule from configuration, not from stateful storage.
Observability: the asynq web UI and metrics
asynq ships with asynqmon, a web UI that shows queue depths, processing rates, error rates, and the dead letter queue contents. For a SaaS backend, running asynqmon behind an internal VPN or an authenticated reverse proxy gives the team real-time visibility into job processing health.
For production alerting, export the following metrics to Prometheus:
- Queue depth per queue (critical: alert if above 100 pending for more than 2 minutes)
- Processing rate per queue (alert if processing rate drops to zero during business hours)
- Dead letter queue size (alert if above 50)
- Job processing latency p99 per job type
In MENA SaaS platforms specifically, monitoring during Ramadan is important: traffic patterns shift significantly, and the evening rush that follows iftar can spike queue depths quickly.
Key lessons from production
Separate queues for different job priorities from day one. Retrofitting queue separation after a bulk job type has started starving transactional operations is operationally disruptive.
All job handlers must be idempotent. Document the idempotency strategy for each job type, because the next engineer who modifies the handler needs to preserve it.
Monitor the dead letter queue actively. Silent accumulation of failed jobs represents data loss that is often discovered months after the fact.
The asynq web UI should be accessible to the engineering team in production without requiring a database query. Real-time visibility into queue health prevents many support incidents from escalating.
Enjoying this article?
Enter your email and get a clean, formatted PDF of this article - free, no spam.
Not sure where to start?
Voxire builds background processing systems for SaaS platforms in Lebanon and across the MENA region. If you are designing a job queue system or debugging reliability issues in an existing background processing implementation, reach out.
https://voxire.com/get-a-quote/



