Go's concurrency primitives make async job processing look deceptively simple. The default patterns work in tutorials but leak goroutines, drop jobs, and saturate your database in production SaaS workloads.
Go's goroutines and channels are genuinely powerful tools for concurrent system design. They are also responsible for some of the most insidious production bugs we have debugged in SaaS backends. The problem is not the primitives themselves but the gap between how they look in documentation examples and how they behave under the load patterns that real SaaS products generate.
This is a practical guide to the concurrency patterns we use in production Go backends serving businesses in Lebanon and MENA, with specific attention to worker pools, channel design, and backpressure.
Why raw goroutines fail in production SaaS workloads
The entry point for most Go concurrency in production is something like this: an HTTP handler receives a request, decides that some work should happen asynchronously, and spawns a goroutine.
func (h *Handler) HandleWebhook(w http.ResponseWriter, r *http.Request) {
payload := parseWebhook(r)
go h.processWebhook(payload) // fire and forget
w.WriteHeader(http.StatusAccepted)
}
This works in development. It works in early production. It fails at scale for several reasons.
First, there is no limit on goroutine creation. A burst of 10,000 webhook deliveries spawns 10,000 goroutines. Each goroutine needs stack space and typically a database connection. If the database connection pool has 20 connections and 10,000 goroutines are all trying to acquire one, you get 9,980 goroutines blocked waiting, holding memory, and not making progress. Under sustained load, goroutine count grows unboundedly.
Second, there is no backpressure. When the system is overloaded, it accepts more work rather than signaling to the caller that it should slow down or try later. The goroutine queue grows, memory increases, and eventually the process is killed by the OOM killer or crashes.
Third, there is no visibility. You cannot easily observe how many goroutines are running, how many jobs are waiting, or whether jobs are being dropped. The system fails silently.
Fourth, there is no graceful shutdown. When the process receives SIGTERM, in-flight goroutines may be killed mid-execution. Database writes may be partial. Idempotency guarantees are violated.
Building a bounded worker pool from scratch
A worker pool solves the goroutine explosion by creating a fixed number of goroutines at startup and routing all work through them.
type WorkerPool struct {
jobs chan Job
wg sync.WaitGroup
workers int
}
func NewWorkerPool(workers int, queueDepth int) *WorkerPool {
p := &WorkerPool{
jobs: make(chan Job, queueDepth),
workers: workers,
}
return p
}
func (p *WorkerPool) Start() {
for i := 0; i < p.workers; i++ {
p.wg.Add(1)
go func() {
defer p.wg.Done()
for job := range p.jobs {
job.Execute()
}
}()
}
}
func (p *WorkerPool) Submit(job Job) bool {
select {
case p.jobs <- job:
return true
default:
return false // queue full — reject the job
}
}
func (p *WorkerPool) Shutdown() {
close(p.jobs)
p.wg.Wait()
}
A few design decisions worth explaining.
queueDepth is the buffered channel size. This determines how many jobs can be waiting before Submit starts rejecting. Set it based on your latency tolerance and memory budget. For webhook processing with a 10-second acceptable latency and 1ms per job, a queue depth of 10,000 gives you 10 seconds of buffering. Each job in the queue uses the memory of the job struct, which should be small.
Submit uses a non-blocking select. This is the backpressure mechanism. When the queue is full, the method returns false instead of blocking the caller. The caller decides what to do: log a metric, push to a persistent queue like Redis, return HTTP 429, or retry with backoff.
Shutdown closes the jobs channel and waits for all workers to drain it before returning. This is clean shutdown: no in-flight jobs are abandoned when the process exits.
Channel backpressure: how to reject work gracefully
Backpressure is the signal a system sends when it is at capacity. Without backpressure, the system accepts work it cannot process, which leads to queue buildup, latency increases, and eventual failure.
The non-blocking Submit above is one form of backpressure. The caller gets an immediate false and can act on it. For HTTP handlers, the right response to a full queue depends on the contract with the caller.
For webhook delivery from external services, return HTTP 429 with a Retry-After header. Well-behaved webhook senders will back off and retry.
For internal job submission within the same process, push the job to a Redis list or a database queue table instead. This provides durability: if the process restarts, the job survives. The worker pool can drain from Redis on startup.
For user-facing request handlers, consider whether the work truly needs to be async. If the user is waiting for a result and the queue is full, returning an error is better than making the user wait indefinitely.
Error handling and retry logic
Worker pool jobs fail. The network is unavailable. The database query returns a deadlock. The third-party API returns a 503. How the pool handles these failures determines whether the system is reliable.
The basic pattern is retry with exponential backoff and a maximum attempt count.
func executeWithRetry(job Job, maxAttempts int) {
for attempt := 1; attempt <= maxAttempts; attempt++ {
err := job.Execute()
if err == nil {
return
}
if !isRetryable(err) {
logPermanentFailure(job, err)
return
}
if attempt < maxAttempts {
backoff := time.Duration(attempt*attempt) * 100 * time.Millisecond
time.Sleep(backoff)
}
}
logExhaustedRetries(job)
}
Not all errors are retryable. A database unique constraint violation will not resolve itself with a retry. A 400 from an external API means the request is malformed. These should be logged as permanent failures and not retried.
Retryable errors are transient: network timeouts, 503s, database connection errors, deadlocks. These have a reasonable chance of succeeding on retry.
For jobs that must never be lost even after exhausting retries, write the failed job to a dead letter table in the database with the error, the attempt count, and the original payload. A human or automated process can review and requeue dead letter jobs.
Monitoring pool health in production
A worker pool you cannot observe is a risk. The minimum instrumentation for a production worker pool is three metrics: queue depth, active worker count, and job duration.
type WorkerPool struct {
jobs chan Job
wg sync.WaitGroup
workers int
activeCount atomic.Int64
processed atomic.Int64
failed atomic.Int64
}
func (p *WorkerPool) QueueDepth() int {
return len(p.jobs)
}
func (p *WorkerPool) ActiveWorkers() int64 {
return p.activeCount.Load()
}
func (p *WorkerPool) ProcessedTotal() int64 {
return p.processed.Load()
}
Expose these values via a /metrics endpoint or push them to Prometheus or your observability platform of choice. Alert when queue depth exceeds 80% of capacity. Alert when active workers is consistently equal to total workers (saturation).
For SaaS products serving businesses in Lebanon and MENA, having a Grafana dashboard that shows queue depth over time for each worker pool type is extremely useful during incidents. You can see exactly when the queue started building and correlate with external events like a traffic spike or a slow database query.
Coordinating shutdown cleanly
Graceful shutdown of a worker pool-based system requires coordination between the HTTP server, the pool, and the pool's workers.
func main() {
pool := NewWorkerPool(10, 1000)
pool.Start()
server := &http.Server{
Addr: ":8080",
Handler: NewRouter(pool),
}
// Wait for termination signal
stop := make(chan os.Signal, 1)
signal.Notify(stop, syscall.SIGTERM, syscall.SIGINT)
<-stop
// Stop accepting new HTTP requests
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
server.Shutdown(ctx)
// Drain the pool queue
pool.Shutdown()
}
The shutdown sequence matters. Stop the HTTP server first so no new jobs are submitted. Then drain the pool. This guarantees that all accepted jobs are processed before the process exits.
The 30-second timeout on server shutdown prevents the process from hanging indefinitely if something is wrong. Adjust this based on your maximum expected job execution time.
Key lessons from production
Never use unbounded goroutine spawning for production job processing. Always use a bounded worker pool.
Design backpressure before you need it. The Submit rejection path is as important as the happy path.
Error classification matters. Retryable errors should retry. Permanent errors should fail fast and alert.
A dead letter system prevents silent job loss. Every failed-to-process job should be recoverable.
Expose queue depth and worker saturation as metrics. You cannot optimize what you cannot observe.
Graceful shutdown is a first-class requirement. Partial database writes from killed goroutines create inconsistency bugs that are extremely difficult to debug.
Enjoying this article?
Enter your email and get a clean, formatted PDF of this article - free, no spam.
Not sure where to start?
Voxire designs and builds production Go backends for SaaS companies in Lebanon and the MENA region. If your backend needs reliable async job processing or you are dealing with concurrency issues in production, we can help.
https://voxire.com/get-a-quote/



