Identifying Database and Network Bottlenecks

In most web applications, the database is the single biggest bottleneck. Ask any experienced performance engineer and they will tell you: "It is almost always the database." Why? Because the database is the only component that cannot be horizontally scaled as easily as application servers. You can spin up 10 more app server instances in minutes, but splitting a database requires careful planning, schema changes, and data migration. When your app server CPU is at 30% but response times are through the roof, the problem is almost certainly downstream -- and the database is suspect number one.

Slow Queries -- The Usual Suspect

A query that takes 10ms with 10 concurrent users might take 2 seconds with 200 users. Why? Because database queries compete for the same resources: CPU for computation, memory for caching, disk for reads, and locks for data consistency. A query that is "fast enough" in development becomes a bottleneck under load because it was never tested with concurrent access.

Finding Slow Queries (PostgreSQL)sql

-- Enable slow query logging (in postgresql.conf)
-- log_min_duration_statement = 500  -- Log queries taking > 500ms

-- Find the slowest queries during the test window
SELECT query,
       calls,
       mean_exec_time::numeric(10,2) AS avg_ms,
       max_exec_time::numeric(10,2) AS max_ms,
       total_exec_time::numeric(10,2) AS total_ms,
       rows
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Find queries with the most total time (cumulative impact)
SELECT query,
       calls,
       total_exec_time::numeric(10,2) AS total_ms,
       (total_exec_time / calls)::numeric(10,2) AS avg_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- Check for missing indexes
SELECT relname AS table,
       seq_scan,
       idx_scan,
       seq_scan - idx_scan AS seq_over_idx
FROM pg_stat_user_tables
WHERE seq_scan > idx_scan
ORDER BY seq_over_idx DESC;
-- If seq_scan >> idx_scan, the table is being full-scanned. Add an index.

Finding Slow Queries (MySQL)sql

-- Enable slow query log
-- SET GLOBAL slow_query_log = ON;
-- SET GLOBAL long_query_time = 0.5;  -- 500ms threshold

-- Find running queries right now (during load test)
SHOW FULL PROCESSLIST;
-- Look for queries with long Time values and State = "Sending data" or "Sorting"

-- After the test, analyze the slow query log:
-- mysqldumpslow -s t /var/log/mysql/slow.log
-- This groups similar queries and sorts by total time

-- Check for table locks (MyISAM) or row locks (InnoDB)
SHOW ENGINE INNODB STATUS\G
-- Look for "LATEST DETECTED DEADLOCK" and "TRANSACTIONS" sections

-- Check InnoDB buffer pool hit rate
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';
-- If Innodb_buffer_pool_reads is high relative to
-- Innodb_buffer_pool_read_requests, your buffer pool is too small
-- (data is being read from disk instead of memory)

Connection Pool Exhaustion

Every application maintains a pool of database connections. Creating a new database connection is expensive (TCP handshake, authentication, SSL negotiation), so applications reuse connections from a pool. Under normal load, the pool has plenty of idle connections. Under heavy load, all connections are busy, and new requests have to wait. If the wait exceeds a timeout, you get connection pool exhaustion errors.

Pool Symptom	What You See in Load Test	Root Cause	Fix
Response times suddenly spike at a specific user count	All endpoints slow simultaneously	Pool ran out of connections, requests queuing	Increase pool size (but check DB can handle it)
Intermittent connection timeout errors	Sporadic 500s across all endpoints	Pool too small for burst traffic	Increase pool size, add connection timeout logging
Everything slow after long test (soak)	Gradual degradation over hours	Connection leak -- connections borrowed but never returned	Fix code to close connections in finally blocks, enable leak detection
DB CPU low but app response times high	DB metrics look fine, app is slow	App waiting for pool, not for DB	Check pool utilization metrics, increase pool size

Connection Pool Sizingtext

# HikariCP (most popular Java connection pool) configuration
# These are the critical settings for performance:

maximumPoolSize: 20       # Max connections in the pool
minimumIdle: 5            # Keep at least 5 connections ready
connectionTimeout: 30000  # Wait 30s for a connection (then fail)
idleTimeout: 600000       # Close idle connections after 10 minutes
maxLifetime: 1800000      # Recycle connections every 30 minutes
leakDetectionThreshold: 60000  # Log warning if connection not returned in 60s

# Formula for pool size (PostgreSQL recommendation):
# connections = ((core_count * 2) + effective_spindle_count)
# For a 4-core server with SSD: (4 * 2) + 1 = 9 connections
# Yes, that is surprisingly small! More connections != more performance.
# Too many connections cause context-switching overhead in the database.

# WARNING: Total connections across all app instances must not exceed
# the database max_connections setting.
# 5 app servers x 20 pool size = 100 connections
# PostgreSQL default max_connections = 100 <-- you will hit the limit!

Lock Contention

When multiple requests try to modify the same rows simultaneously, the database uses locks to maintain data consistency. Under low load, lock waits are imperceptible. Under high load, they stack up. Imagine a single checkout counter at a store -- with 5 customers, the wait is fine. With 500 customers, the line wraps around the building. If you see a specific transaction type (like "checkout" or "transfer") that is slow while read-only operations (like "view product") are fast, lock contention is the prime suspect.

Network Bottlenecks

Network bottlenecks are less common than database bottlenecks but can be tricky to diagnose because they look similar from the client side -- slow response times. The key differentiator is that network bottlenecks affect all endpoints equally, while application bottlenecks affect specific endpoints.

Bandwidth saturation -- The pipe is full. If your server has a 1 Gbps link and you are transferring large payloads, you can saturate it. Monitor with iftop or nload. Common with image-heavy applications or large API responses.
High latency between components -- If the app server and database are in different data centers or availability zones, each query adds network round-trip time. A query that takes 1ms on the DB might take 10ms door-to-door. Multiply by 20 queries per request and you have 200ms of pure network overhead.
DNS resolution delays -- If every request triggers a DNS lookup, latency adds up. Check if DNS caching is enabled. Usually manifests as inconsistent response times (cache hit vs. miss).
TCP connection exhaustion -- Similar to connection pool issues but at the OS level. The server runs out of ephemeral ports or file descriptors. Check with "ss -s" or "netstat -an | wc -l".
SSL/TLS handshake overhead -- Each new HTTPS connection requires a TLS handshake (CPU-intensive). Under load, this can be significant. Solution: enable HTTP keep-alive and connection reuse.

Network Monitoring Commandsbash

# Monitor bandwidth usage in real-time
iftop -i eth0        # Interactive bandwidth monitor
nload eth0           # Simpler, shows in/out bandwidth

# Check open connections and their states
ss -s                # Summary of socket statistics
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
# Look for many TIME_WAIT connections -- indicates connection churn

# Check for packet drops (congestion indicator)
netstat -s | grep -i drop
netstat -s | grep -i error

# Check latency to the database server
ping -c 10 db-server-ip
# If average > 1ms for same-region, something is wrong

# Check DNS resolution time
dig +stats api.example.com | grep "Query time"
# Should be < 5ms with caching. > 50ms = DNS problem.

# Monitor TCP retransmissions (network congestion)
netstat -s | grep retransmit

Q: Your load test shows that response times increase linearly with users, but the application server CPU is only at 40%. Where is the bottleneck?

A: If the app server CPU is low but response times increase with load, the bottleneck is downstream -- most likely the database. I would check several things in order: First, database CPU and query performance. If the DB CPU is high, slow queries are likely the cause. I would check the slow query log and pg_stat_statements. Second, connection pool utilization. If the pool is fully utilized, requests are queuing for connections even though the app server itself has spare CPU. Third, lock contention in the database -- if many users are updating the same rows, they serialize on locks. Fourth, network latency between the app server and database. Finally, I would check if there is an external API call in the request path that is slow. The key insight is: low app server CPU with high response times almost always means the app is waiting for something -- a database, an external service, a lock, or a connection.

Key Point: The database is the most common bottleneck in web applications. Look for slow queries, connection pool exhaustion, and lock contention. If app server CPU is low but response times are high, the app is waiting for something downstream.

Key Point: Database bottlenecks (slow queries, connection pool exhaustion, lock contention) are the most common root cause. Low app server CPU + high response times = look downstream.

Previous Up NextAPM Tools and Profiling Under Load

Chapter 8: Analyzing Results and Bottleneck Identification