PostgreSQL Query Optimization in Multi-Tenant SaaS Applications

PostgreSQL query performance issues in multi-tenant SaaS products follow predictable patterns. The same five problems appear at scale across most products: missing composite indexes, poor connection management, sequential scans on large tables, N+1 query chains, and unmonitored slow queries. Here is how to diagnose and fix each one.

Why do PostgreSQL performance problems appear later in a SaaS lifecycle?

A SaaS product at 10 tenants with 1,000 rows per table has no perceptible database performance issues. At 100 tenants with 100,000 rows per table, slow queries start appearing in logs. At 1,000 tenants, some queries that were fine in development become production outages.

The reason is that query plans change as table sizes grow. PostgreSQL's query planner makes decisions based on table statistics. A sequential scan that costs 2ms on a 10,000-row table costs 200ms on a 1,000,000-row table. The query did not change. The data size changed, and the planner's decision became wrong.

This dynamic is predictable. The right response is building observability before the problems appear, not debugging in production after users are affected.

The composite index problem

In a multi-tenant SaaS database, almost every meaningful query filters on tenant_id plus one or two additional columns. A query like:

SELECT * FROM orders
WHERE tenant_id = $1 AND status = 'pending'
ORDER BY created_at DESC;

This query needs a composite index on (tenant_id, status, created_at). An index on tenant_id alone is not sufficient - PostgreSQL will use it to filter to the tenant's rows and then scan all of them to find pending orders. An index on status alone is also not sufficient.

The composite index should lead with the most selective column that appears in an equality condition. tenant_id and status are both equality conditions here. created_at is used for ordering, so it goes last in the index.

Run EXPLAIN (ANALYZE, BUFFERS) on your slower queries to see the actual execution plan. Any plan that shows a Seq Scan on a large table is a candidate for an index. Any plan that shows a high Rows Removed by Filter count relative to the rows actually returned is an index problem.

Connection exhaustion and PgBouncer configuration

PostgreSQL allocates memory per connection. A PostgreSQL instance with max_connections = 100 starts rejecting new connections when 100 are in use, regardless of whether those connections are actively doing work.

In a multi-tenant SaaS Go backend, the typical failure mode is an application that opens a connection per goroutine during high traffic. Without connection pooling, this can exhaust the database connection limit during peak load even though individual queries are fast.

PgBouncer in transaction pooling mode is the standard solution. It maintains a small pool of actual PostgreSQL connections and multiplexes many application connections across them. A PgBouncer configuration with pool_size = 20 per database allows hundreds of application goroutines to share 20 actual PostgreSQL connections.

The max_client_conn setting in PgBouncer controls how many application connections it accepts. Set this to match the maximum expected concurrent application connections, not to PostgreSQL's max_connections. They are independent settings.

Monitor PgBouncer's wait queue. If applications are queuing for connections, either increase pool_size or reduce query duration so connections return to the pool faster.

Identifying slow queries with pg_stat_statements

pg_stat_statements is a PostgreSQL extension that tracks execution statistics for every distinct query. It is the first tool to enable on any production SaaS database.

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

To find the top queries by total execution time:

SELECT query,
       calls,
       mean_exec_time,
       total_exec_time,
       rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

This query surfaces the queries that are costing the most cumulative time. High mean_exec_time with high calls is the dangerous combination. A query that takes 100ms and runs 10,000 times per hour is adding 1,000 seconds of database load per hour.

For multi-tenant SaaS products, add per-tenant query tagging. PostgreSQL pg_stat_statements groups queries that differ only in literal values, so WHERE tenant_id = 1 and WHERE tenant_id = 2 appear as the same query. Tagging with /* tenant_id=xyz */ comments allows you to filter by tenant in the statistics when investigating tenant-specific issues.

N+1 query chains in Go ORM and query layers

N+1 query chains occur when a query returns N rows and the application then makes one additional query per row to fetch related data. The result is N+1 database round trips for what should be a single query with a join.

In Go backends using sqlc or hand-written SQL, N+1 chains are visible in the code. A loop that calls a database function inside it is a candidate for N+1. The fix is a single query with a JOIN or IN clause that fetches all related data at once.

In Go backends using ORMs, N+1 chains can be invisible. The ORM generates the loop queries automatically from relationship traversal. Tools like go-pg and ent have explicit eager loading mechanisms that must be used intentionally.

In pg_stat_statements, an N+1 chain appears as a single query with very high calls relative to the calling query. If a query appears 10,000 times in an hour and the endpoint it comes from receives 1,000 requests, there is a 10x multiplier that indicates an N+1 chain.

VACUUM, AUTOVACUUM, and table bloat

PostgreSQL uses multiversion concurrency control (MVCC). When a row is updated or deleted, the old version is not immediately removed. Dead row versions accumulate in the table and must be cleaned up by VACUUM.

AUTOVACUUM handles this automatically. The problem in multi-tenant SaaS databases is that tables with high update rates (orders, sessions, event logs) can outpace autovacuum's default settings, leading to table bloat that degrades query performance.

Monitor table bloat with:

SELECT schemaname, tablename,
       n_dead_tup, n_live_tup,
       round(n_dead_tup::numeric / nullif(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

Tables with a dead_pct above 10-20% in a production system are candidates for autovacuum tuning. Adjust autovacuum_vacuum_scale_factor and autovacuum_vacuum_threshold at the table level for high-churn tables.

Partitioning for large tenant-specific tables

For multi-tenant SaaS tables that grow into the hundreds of millions of rows, table partitioning by tenant_id or by time range becomes worth the complexity.

Partitioning by time range is common for event logs and audit tables. A monthly partition for a table that receives 10 million events per month means queries filtered by date range only scan the relevant partition, not the full table history.

Partitioning by tenant_id (hash partitioning) is appropriate when tenants have very different data sizes and a small number of large tenants dominate query load. Hash partitioning distributes rows across a fixed number of partitions based on the hash of tenant_id.

Both approaches require planning. Existing tables must be migrated to partitioned tables, which requires downtime or careful online migration procedures. Design for partitioning before the table is too large to migrate.

Key lessons from production

Enable pg_stat_statements on day one. Waiting until performance problems appear means debugging without visibility.

Composite indexes on (tenant_id, filter_column, sort_column) are necessary in multi-tenant tables. A single-column index on tenant_id alone is insufficient for most queries.

PgBouncer in transaction mode is not optional at scale. Plan connection pool sizing for peak concurrent load, not average load.

N+1 query chains are the most common avoidable performance issue in Go SaaS backends. Review any database call inside a loop.

AUTOVACUUM settings need tuning for high-churn tables. Monitor dead row percentage on tables that have frequent updates or deletes.

Why do PostgreSQL performance problems appear later in a SaaS lifecycle?

The composite index problem

Connection exhaustion and PgBouncer configuration

Identifying slow queries with pg_stat_statements

N+1 query chains in Go ORM and query layers

VACUUM, AUTOVACUUM, and table bloat

Partitioning for large tenant-specific tables

Key lessons from production

Not sure where to start?

Keep reading

Feature Flags in Production Go SaaS: Deploying Features Without Restarts

Production Audit Logs in Go SaaS: Design, Storage, and What Most Teams Get Wrong

Go API Testing in Production SaaS: The Patterns That Actually Catch Real Bugs