Here is a scenario that will change how you think about averages forever. Your API handles 100 requests. 99 of them complete in 100ms. One request takes 10,000ms (10 seconds). The average response time is 199ms. Looks great on a dashboard. But one user just stared at a loading spinner for 10 seconds. The average lied to you.
Averages are heavily influenced by outliers and they hide the distribution. If half your requests take 50ms and half take 5 seconds, the average is 2.5 seconds. That single number tells you nothing useful. Half your users had a great experience and half wanted to throw their laptop. The average makes it sound mediocre but acceptable.
A percentile answers this question: "X% of requests completed within this time." p95 = 2 seconds means 95 out of every 100 requests completed in 2 seconds or less. The other 5 took longer.
| Percentile | Meaning | Out of 1000 Requests | When to Use |
|---|---|---|---|
| p50 (median) | 50% complete within this time | 500 are faster, 500 are slower | Typical user experience |
| p75 | 75% complete within this time | 750 faster, 250 slower | Most users experience |
| p90 | 90% complete within this time | 900 faster, 100 slower | Good baseline metric |
| p95 | 95% complete within this time | 950 faster, 50 slower | Industry standard for SLAs |
| p99 | 99% complete within this time | 990 faster, 10 slower | Near-worst case |
| p99.9 | 99.9% complete within this time | 999 faster, 1 slower | Critical systems (banking, healthcare) |
Here is a real test result from a login API under load. Let us read it together.
Metric Value
───────────── ──────
Min 45ms
Average 320ms
p50 (median) 180ms
p75 350ms
p90 650ms
p95 1200ms
p99 4500ms
Max 12000ms
Throughput 850 req/s
Error Rate 0.3%What does this tell us? The average of 320ms looks fine. But the p95 is 1200ms -- that means 5% of users (25 out of 500) wait over 1.2 seconds just to log in. And the p99 is 4.5 seconds -- 1% of users (5 people) wait over 4.5 seconds. If your SLA says p95 < 1 second, this test FAILS, even though the average looks good.
A large gap between p95 and p99 is a red flag. It means a small percentage of requests are experiencing something very different from the majority. Common causes: garbage collection pauses, connection pool waits, cache misses (most requests hit cache, a few hit the database), and cold starts in serverless environments.
When reporting results, always include at least p50, p95, and p99. If stakeholders only want one number, give them p95. It represents what the vast majority of users experience while still accounting for slower requests. If they push back with "the average is fine," show them the p99 and explain that those are real users having a terrible experience.
You cannot average percentiles across multiple endpoints. If API A has p95 = 100ms and API B has p95 = 200ms, the combined p95 is NOT 150ms. It could be anything from 100ms to 200ms depending on the distribution. If you need a combined percentile, you must merge the raw data and recalculate.
Many reporting tools show averaged percentiles by default. This is mathematically wrong and can mask serious problems. Always verify that your tool calculates percentiles from raw data, not by averaging pre-calculated percentiles from different time windows or endpoints.
Q: Why do we use percentiles instead of average response time in performance testing?
A: Averages are misleading because they hide the distribution. If 99 requests take 100ms and 1 takes 10 seconds, the average is 199ms -- looks fine, but one user had a terrible experience. Percentiles show the real picture: p95 = 100ms means 95% of users had a great experience, p99 = 10,000ms reveals that 1% had an awful one. I always report p50 (typical experience), p95 (industry standard for SLAs), and p99 (near-worst case). The gap between p95 and p99 is diagnostic -- a large gap indicates outlier issues like GC pauses or cache misses. You also cannot average percentiles mathematically -- they must be recalculated from raw data.
Key Point: Averages lie. Percentiles tell the truth. p95 is the industry standard for SLAs. A large gap between p95 and p99 is a red flag. Never average percentiles -- recalculate from raw data.
Key Point: Averages lie, percentiles tell the truth. Use p95 for SLAs, watch the p95-p99 gap for outlier issues.