Response time tells you the symptom. Resource utilization tells you the cause. When response times spike, is it because the CPU is at 100%? The database is out of connections? Memory is full? Without server-side metrics, you are guessing.
| Metric | Healthy | Warning | Critical | What to Do |
|---|---|---|---|---|
| CPU Usage | < 60% | 60-80% | > 80% | Optimize code, add servers, increase cores |
| Memory Usage | < 70% | 70-85% | > 85% | Fix memory leaks, increase RAM, optimize caching |
| Disk I/O | < 60% | 60-80% | > 80% | Optimize queries, add SSD, reduce logging |
| Network I/O | < 60% | 60-80% | > 80% | CDN, compression, reduce payload size |
| DB Connections | < 70% of pool | 70-90% | > 90% | Fix connection leaks, increase pool size |
The real power is correlating server metrics with test metrics. When response time jumps at 300 users, which server metric also jumped? If CPU hit 95% at 300 users, the CPU is the bottleneck. If CPU is fine but database connections hit 100%, the database pool is the bottleneck. Without this correlation, you are fixing the wrong thing.
CPU at 95%, everything else normal → application code is the bottleneck. Optimize algorithms, add caching, or add CPU cores.
Memory growing steadily → memory leak. Profile the application to find the leak source. Run a soak test to confirm.
DB connections at max, CPU low → connection pool exhaustion. Connections are not being returned. Fix the code or increase pool size.
Disk I/O at 100%, CPU normal → too many disk operations. Move to SSD, reduce logging verbosity, optimize database queries.
Network saturated → responses are too large. Enable compression (gzip), use a CDN, paginate API responses.
All resources low but response time high → external dependency is slow. An API call, database query, or third-party service is the bottleneck.
| Tool | What It Monitors | Cost |
|---|---|---|
| Grafana + Prometheus | Server metrics, custom metrics, dashboards | Free (open source) |
| New Relic | APM, server metrics, distributed tracing | Paid (free tier available) |
| Datadog | Infrastructure, APM, logs, synthetics | Paid (free tier available) |
| htop / top | CPU, memory, processes (command line) | Free (built into Linux) |
| JMeter PerfMon | Server metrics plugin for JMeter | Free (JMeter plugin) |
Always set up server monitoring BEFORE running performance tests. Knowing response times without knowing server resource usage is like knowing a patient has a fever without checking their blood work. Start with simple tools like htop on the server or JMeter PerfMon plugin, then graduate to Grafana + Prometheus for production-grade monitoring.
Q: What server-side metrics do you monitor during performance tests?
A: I monitor five key metrics: CPU usage (should stay under 70-80%), memory usage (watch for gradual growth indicating leaks), disk I/O (high I/O suggests query or logging issues), network I/O (saturated network means payload optimization needed), and database connection pool usage (near max indicates connection leaks). The key is correlating server metrics with test metrics. If response time spikes at 300 users and CPU hits 95% at the same point, CPU is the bottleneck. If all server metrics are low but response time is high, the bottleneck is an external dependency. I use Grafana + Prometheus or JMeter PerfMon plugin for monitoring.
Key Point: Server metrics (CPU, memory, disk, network, DB connections) tell you WHY performance degrades. Correlate them with response time to pinpoint the bottleneck. Always set up monitoring before running tests.
Key Point: Server metrics reveal the cause. Correlate CPU, memory, disk, network, and DB connections with response times to find the real bottleneck.