Chapter 8: Analyzing Results and Bottleneck Identification
Now that you know how to read the reports, let us go deeper into the two metrics that matter most: response time and throughput. Most beginners look at these in isolation -- "response time was 500ms, throughput was 100 req/sec." That tells you almost nothing. The real insight comes from looking at how these metrics change in relation to each other and to the number of concurrent users. Imagine you are driving a car. Speed (throughput) and fuel efficiency (response time) are related -- you can go faster up to a point, but push too hard and the engine overheats, speed drops, and fuel efficiency tanks. The same thing happens to servers under load.
Every system goes through three phases as you increase load. Understanding these phases is the foundation of all performance analysis.
Your job as a performance tester is to find where Phase 1 ends and Phase 2 begins. That transition point is your system's capacity. Everything after that is damage control. The transition from Phase 2 to Phase 3 is where you lose customers.
When you look at a response time histogram (available in both JMeter and Gatling reports), the shape tells a story. A healthy system produces a tight, left-leaning distribution -- most responses cluster near the same fast time. An unhealthy system produces either a wide spread or a bimodal distribution with two humps.
| Distribution Shape | What It Looks Like | What It Means | Common Cause |
|---|---|---|---|
| Tight left cluster | All responses near 50-100ms | Healthy, consistent performance | Well-optimized system under comfortable load |
| Wide spread | Responses from 50ms to 5000ms, no clear peak | Highly inconsistent performance | Thread contention, mixed fast/slow queries, shared resources |
| Bimodal (two humps) | Cluster at 80ms AND cluster at 2000ms | Two distinct response paths | Cache hit vs. cache miss, or fast table vs. slow join |
| Long right tail | Most at 100ms, but tail stretches to 30s | Occasional extreme outliers | GC pauses, lock waits, timeout retries, connection pool exhaustion |
| Shifted right | Entire distribution centered at 2000ms+ | Everything is slow | Fundamental capacity problem, downstream dependency slow |
Raw throughput (requests per second) is useful, but you need context. A throughput of 500 req/sec with 50 users means each user gets 10 requests served per second -- great. The same 500 req/sec with 5,000 users means each user gets 0.1 requests per second -- they are waiting 10 seconds per request. Always relate throughput to concurrent users.
# Little's Law: Concurrency = Throughput x Response Time
# Or rearranged: Throughput = Concurrency / Response Time
# Example 1: Healthy system
Concurrent users: 100
Avg response time: 0.2 seconds
Expected throughput: 100 / 0.2 = 500 req/sec
Actual throughput: 480 req/sec ← Close to expected. System is healthy.
# Example 2: Saturated system
Concurrent users: 500
Avg response time: 2.0 seconds
Expected throughput: 500 / 2.0 = 250 req/sec
Actual throughput: 180 req/sec ← Below expected. Queuing is happening.
# Example 3: Collapsed system
Concurrent users: 1000
Avg response time: 15.0 seconds
Expected throughput: 1000 / 15.0 = 67 req/sec
Actual throughput: 40 req/sec ← Way below. Threads dying, connections timing out.
# Key insight: If actual throughput is much lower than
# Little's Law predicts, requests are being dropped or queued.Open both charts side by side (or overlay them mentally). Look for the moment when throughput stops rising. Now look at the response time chart at that same timestamp. Did response times start climbing? If yes, you have found your saturation point. Now check the Active Users chart -- how many users were active at that moment? That is your system capacity number.
Open the Throughput Over Time chart. Find the point where the line flattens (stops increasing).
Note the exact timestamp where throughput plateaued.
Open the Response Times Over Time chart. Check what happened at the same timestamp.
Open the Active Users Over Time chart. Note the user count at that timestamp.
Check the Errors Over Time chart. Did errors start appearing near the same timestamp?
Document: "System saturated at [X] concurrent users, with throughput plateau of [Y] req/sec and response times starting to exceed [Z] ms."
Q: Explain Little's Law and how you use it in performance testing.
A: Little's Law states that the average number of items in a system equals the arrival rate multiplied by the average time each item spends in the system. In performance testing terms: Concurrent Users = Throughput times Average Response Time. I use it as a sanity check after every test. If I have 200 concurrent users and the average response time is 500ms, I should see roughly 400 requests per second. If actual throughput is significantly lower, it means requests are being queued, dropped, or timing out. The gap between expected and actual throughput quantifies how much the system is struggling. It is also useful for capacity planning -- if we need to support 1,000 concurrent users with a 1-second response time SLA, we need infrastructure that can sustain 1,000 requests per second.
Key Point: Response time and throughput are two sides of the same coin. Always analyze them together against the concurrent user count to find the saturation point -- the moment when the system cannot take on more work without degrading.
Key Point: Use Little's Law (Concurrency = Throughput x Response Time) to sanity-check results and find the saturation point where response times start rising while throughput plateaus.