Chapter 8: Analyzing Results and Bottleneck Identification
In most web applications, the database is the single biggest bottleneck. Ask any experienced performance engineer and they will tell you: "It is almost always the database." Why? Because the database is the only component that cannot be horizontally scaled as easily as application servers. You can spin up 10 more app server instances in minutes, but splitting a database requires careful planning, schema changes, and data migration. When your app server CPU is at 30% but response times are through the roof, the problem is almost certainly downstream -- and the database is suspect number one.
A query that takes 10ms with 10 concurrent users might take 2 seconds with 200 users. Why? Because database queries compete for the same resources: CPU for computation, memory for caching, disk for reads, and locks for data consistency. A query that is "fast enough" in development becomes a bottleneck under load because it was never tested with concurrent access.
-- Enable slow query logging (in postgresql.conf)
-- log_min_duration_statement = 500 -- Log queries taking > 500ms
-- Find the slowest queries during the test window
SELECT query,
calls,
mean_exec_time::numeric(10,2) AS avg_ms,
max_exec_time::numeric(10,2) AS max_ms,
total_exec_time::numeric(10,2) AS total_ms,
rows
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
-- Find queries with the most total time (cumulative impact)
SELECT query,
calls,
total_exec_time::numeric(10,2) AS total_ms,
(total_exec_time / calls)::numeric(10,2) AS avg_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
-- Check for missing indexes
SELECT relname AS table,
seq_scan,
idx_scan,
seq_scan - idx_scan AS seq_over_idx
FROM pg_stat_user_tables
WHERE seq_scan > idx_scan
ORDER BY seq_over_idx DESC;
-- If seq_scan >> idx_scan, the table is being full-scanned. Add an index.-- Enable slow query log
-- SET GLOBAL slow_query_log = ON;
-- SET GLOBAL long_query_time = 0.5; -- 500ms threshold
-- Find running queries right now (during load test)
SHOW FULL PROCESSLIST;
-- Look for queries with long Time values and State = "Sending data" or "Sorting"
-- After the test, analyze the slow query log:
-- mysqldumpslow -s t /var/log/mysql/slow.log
-- This groups similar queries and sorts by total time
-- Check for table locks (MyISAM) or row locks (InnoDB)
SHOW ENGINE INNODB STATUS\G
-- Look for "LATEST DETECTED DEADLOCK" and "TRANSACTIONS" sections
-- Check InnoDB buffer pool hit rate
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';
-- If Innodb_buffer_pool_reads is high relative to
-- Innodb_buffer_pool_read_requests, your buffer pool is too small
-- (data is being read from disk instead of memory)Every application maintains a pool of database connections. Creating a new database connection is expensive (TCP handshake, authentication, SSL negotiation), so applications reuse connections from a pool. Under normal load, the pool has plenty of idle connections. Under heavy load, all connections are busy, and new requests have to wait. If the wait exceeds a timeout, you get connection pool exhaustion errors.
| Pool Symptom | What You See in Load Test | Root Cause | Fix |
|---|---|---|---|
| Response times suddenly spike at a specific user count | All endpoints slow simultaneously | Pool ran out of connections, requests queuing | Increase pool size (but check DB can handle it) |
| Intermittent connection timeout errors | Sporadic 500s across all endpoints | Pool too small for burst traffic | Increase pool size, add connection timeout logging |
| Everything slow after long test (soak) | Gradual degradation over hours | Connection leak -- connections borrowed but never returned | Fix code to close connections in finally blocks, enable leak detection |
| DB CPU low but app response times high | DB metrics look fine, app is slow | App waiting for pool, not for DB | Check pool utilization metrics, increase pool size |
# HikariCP (most popular Java connection pool) configuration
# These are the critical settings for performance:
maximumPoolSize: 20 # Max connections in the pool
minimumIdle: 5 # Keep at least 5 connections ready
connectionTimeout: 30000 # Wait 30s for a connection (then fail)
idleTimeout: 600000 # Close idle connections after 10 minutes
maxLifetime: 1800000 # Recycle connections every 30 minutes
leakDetectionThreshold: 60000 # Log warning if connection not returned in 60s
# Formula for pool size (PostgreSQL recommendation):
# connections = ((core_count * 2) + effective_spindle_count)
# For a 4-core server with SSD: (4 * 2) + 1 = 9 connections
# Yes, that is surprisingly small! More connections != more performance.
# Too many connections cause context-switching overhead in the database.
# WARNING: Total connections across all app instances must not exceed
# the database max_connections setting.
# 5 app servers x 20 pool size = 100 connections
# PostgreSQL default max_connections = 100 <-- you will hit the limit!When multiple requests try to modify the same rows simultaneously, the database uses locks to maintain data consistency. Under low load, lock waits are imperceptible. Under high load, they stack up. Imagine a single checkout counter at a store -- with 5 customers, the wait is fine. With 500 customers, the line wraps around the building. If you see a specific transaction type (like "checkout" or "transfer") that is slow while read-only operations (like "view product") are fast, lock contention is the prime suspect.
Network bottlenecks are less common than database bottlenecks but can be tricky to diagnose because they look similar from the client side -- slow response times. The key differentiator is that network bottlenecks affect all endpoints equally, while application bottlenecks affect specific endpoints.
# Monitor bandwidth usage in real-time
iftop -i eth0 # Interactive bandwidth monitor
nload eth0 # Simpler, shows in/out bandwidth
# Check open connections and their states
ss -s # Summary of socket statistics
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
# Look for many TIME_WAIT connections -- indicates connection churn
# Check for packet drops (congestion indicator)
netstat -s | grep -i drop
netstat -s | grep -i error
# Check latency to the database server
ping -c 10 db-server-ip
# If average > 1ms for same-region, something is wrong
# Check DNS resolution time
dig +stats api.example.com | grep "Query time"
# Should be < 5ms with caching. > 50ms = DNS problem.
# Monitor TCP retransmissions (network congestion)
netstat -s | grep retransmitQ: Your load test shows that response times increase linearly with users, but the application server CPU is only at 40%. Where is the bottleneck?
A: If the app server CPU is low but response times increase with load, the bottleneck is downstream -- most likely the database. I would check several things in order: First, database CPU and query performance. If the DB CPU is high, slow queries are likely the cause. I would check the slow query log and pg_stat_statements. Second, connection pool utilization. If the pool is fully utilized, requests are queuing for connections even though the app server itself has spare CPU. Third, lock contention in the database -- if many users are updating the same rows, they serialize on locks. Fourth, network latency between the app server and database. Finally, I would check if there is an external API call in the request path that is slow. The key insight is: low app server CPU with high response times almost always means the app is waiting for something -- a database, an external service, a lock, or a connection.
Key Point: The database is the most common bottleneck in web applications. Look for slow queries, connection pool exhaustion, and lock contention. If app server CPU is low but response times are high, the app is waiting for something downstream.
Key Point: Database bottlenecks (slow queries, connection pool exhaustion, lock contention) are the most common root cause. Low app server CPU + high response times = look downstream.