Chapter 8: Analyzing Results and Bottleneck Identification
Your performance test report tells you WHAT is slow. But it does not tell you WHY. To find the root cause, you need to look at the server -- specifically its CPU, memory, disk, and network. Think of it like this: the load test report is the patient's symptoms ("I have a headache and my vision is blurry"), and server monitoring is the blood test that reveals the actual disease ("your blood sugar is dangerously high"). You cannot prescribe treatment based on symptoms alone.
A CPU bottleneck means the processor cannot keep up with the work being asked of it. The classic symptom is a linear relationship between response time and user count -- double the users, double the response time. The server is simply doing too much computation per request.
| CPU Metric | Healthy | Warning | Critical |
|---|---|---|---|
| Overall CPU Usage | < 70% | 70-85% | > 85% sustained |
| User CPU (application work) | < 60% | 60-80% | > 80% |
| System CPU (OS kernel work) | < 15% | 15-30% | > 30% (context switching) |
| IO Wait | < 10% | 10-25% | > 25% (disk is the bottleneck, not CPU) |
| Load Average (Linux) | < number of cores | 1-2x cores | > 2x cores |
# Real-time CPU monitoring (run on the server during the test)
# Quick overview -- updates every 2 seconds
top -b -n 1 | head -20
# Detailed CPU breakdown (user, system, iowait, idle)
mpstat -P ALL 5
# Watch for:
# %usr > 80 → Application is CPU-bound
# %sys > 30 → Too many context switches (too many threads)
# %iowait > 25 → Disk is slow, CPU is waiting
# Per-process CPU usage (find the hungry process)
ps aux --sort=-%cpu | head -10
# Thread-level CPU usage for a Java app (PID = 12345)
top -H -p 12345
# This shows which THREAD is consuming CPU
# Convert the thread ID (LWP) to hex for matching with thread dumps
# Save CPU data to a file for post-test analysis
sar -u 5 > cpu_during_test.log &
# This logs CPU usage every 5 seconds in the backgroundMemory bottlenecks come in two flavors: the system runs out of memory (OOM), or the garbage collector works so hard to reclaim memory that it stalls the application (GC thrashing). The sneakiest variant is a memory leak -- the application works fine for hours, then suddenly crashes. In soak tests, this is the number one thing you are looking for.
A memory leak in a performance test looks like a slowly rising line on the memory usage chart. Under constant load, memory should stabilize -- the app allocates objects, GC cleans them up, memory stays flat. If memory keeps climbing without stabilizing, objects are being allocated but never released. The test might run fine for 30 minutes, then the GC pauses get longer and longer as it desperately tries to free memory, response times spike, and eventually the JVM throws an OutOfMemoryError.
# Overall system memory
free -m -s 5
# Watch "available" column -- this is what the OS can allocate
# If "available" approaches 0, you are in trouble
# Per-process memory usage
ps aux --sort=-%mem | head -10
# Java-specific: Monitor GC activity
# Add these JVM flags to the application:
# -verbose:gc -Xlog:gc*:file=gc.log:time,uptime:filecount=5,filesize=20m
# Watch GC logs for:
# - Increasing GC frequency (GC running every second instead of every 30s)
# - Increasing GC pause times (from 50ms to 500ms to 2000ms)
# - Full GC events (stop-the-world pauses that freeze the entire app)
# Java: Get heap dump for memory leak analysis
# (This pauses the app -- do it on staging, not production)
jmap -dump:format=b,file=heap_dump.hprof <PID>
# Then analyze with Eclipse MAT or VisualVM
# Linux: Monitor for OOM killer activity
dmesg | grep -i "out of memory"
dmesg | grep -i "killed process"| Memory Symptom | Likely Cause | Solution |
|---|---|---|
| Memory grows during load, stabilizes when load stops | Normal -- objects created during request processing | No action needed if it stabilizes within acceptable limits |
| Memory grows during load and NEVER drops back | Memory leak -- objects are retained after requests complete | Heap dump analysis. Check for static collections, unclosed resources, event listener leaks |
| GC pauses getting longer over time | Heap is filling up, GC working harder to find reclaimable objects | Fix the leak, or increase heap size as a short-term workaround |
| Sudden OOM crash after hours of stable running | Slow memory leak. Soak tests are designed to catch exactly this. | Heap dump analysis before the crash. Enable -XX:+HeapDumpOnOutOfMemoryError |
| High %sys CPU combined with high memory usage | OS is swapping -- physical RAM is full, using disk as memory | Add RAM, reduce JVM heap size, or fix the memory leak |
If you are testing a Java application, ALWAYS enable GC logging before the test. Without GC logs, you are flying blind. A 2-second GC pause looks exactly like a slow database query in your JMeter report -- you cannot tell the difference from the client side. GC logs let you pinpoint: "The response time spike at 14:32:05 coincides with a Full GC that paused the JVM for 1.8 seconds."
Q: How would you identify a memory leak during a soak test?
A: I would monitor three things during the soak test: application memory usage over time, GC behavior, and response time trends. A memory leak manifests as a steadily increasing memory baseline -- after each garbage collection cycle, the memory drops back to a slightly higher level than before. Over hours, this accumulates. I would look for response time degradation that correlates with GC activity -- as the heap fills up, GC runs more frequently and takes longer, causing periodic latency spikes. If I confirm a leak, I would capture a heap dump before the application crashes (using jmap or enabling HeapDumpOnOutOfMemoryError). I would then analyze it with Eclipse MAT to find the retained objects -- usually it is something like an unbounded cache, unclosed database connections, or event listeners that are never unregistered. The key metric is: under constant load, if memory usage after GC keeps increasing over time, that is a confirmed leak.
Key Point: CPU bottlenecks cause linear response time growth with user count. Memory bottlenecks cause gradual degradation over time under constant load. Always monitor both during performance tests -- your JMeter/Gatling report cannot distinguish between the two.
Key Point: CPU bottlenecks show linear degradation with user count; memory leaks show gradual degradation over time under constant load. GC logging is essential for Java applications.