Percentiles: The Honest Metric

Here is a scenario that will change how you think about averages forever. Your API handles 100 requests. 99 of them complete in 100ms. One request takes 10,000ms (10 seconds). The average response time is 199ms. Looks great on a dashboard. But one user just stared at a loading spinner for 10 seconds. The average lied to you.

Why Averages Are Dangerous

Averages are heavily influenced by outliers and they hide the distribution. If half your requests take 50ms and half take 5 seconds, the average is 2.5 seconds. That single number tells you nothing useful. Half your users had a great experience and half wanted to throw their laptop. The average makes it sound mediocre but acceptable.

Average Says

✗Average: 199ms -- looks great
✗Average: 2500ms -- looks mediocre
✗Average: 500ms -- acceptable
✗Hides the fact that some users wait 10+ seconds
✗A single outlier skews the number
✗Useless for setting pass/fail criteria

Percentiles Say

✓p50: 100ms, p95: 100ms, p99: 10000ms -- 1% have terrible experience
✓p50: 50ms, p95: 5000ms -- 50% are great, 5% are terrible
✓p50: 200ms, p95: 800ms, p99: 4000ms -- most are OK but tail is bad
✓Shows the distribution clearly
✓Outliers only affect high percentiles where they belong
✓Clear thresholds: p95 < 2s means 95% of users are happy

Understanding Percentile Values

A percentile answers this question: "X% of requests completed within this time." p95 = 2 seconds means 95 out of every 100 requests completed in 2 seconds or less. The other 5 took longer.

Percentile	Meaning	Out of 1000 Requests	When to Use
p50 (median)	50% complete within this time	500 are faster, 500 are slower	Typical user experience
p75	75% complete within this time	750 faster, 250 slower	Most users experience
p90	90% complete within this time	900 faster, 100 slower	Good baseline metric
p95	95% complete within this time	950 faster, 50 slower	Industry standard for SLAs
p99	99% complete within this time	990 faster, 10 slower	Near-worst case
p99.9	99.9% complete within this time	999 faster, 1 slower	Critical systems (banking, healthcare)

Reading Percentiles in Practice

Here is a real test result from a login API under load. Let us read it together.

Metric         Value
─────────────  ──────
Min            45ms
Average        320ms
p50 (median)   180ms
p75            350ms
p90            650ms
p95            1200ms
p99            4500ms
Max            12000ms
Throughput     850 req/s
Error Rate     0.3%

What does this tell us? The average of 320ms looks fine. But the p95 is 1200ms -- that means 5% of users (25 out of 500) wait over 1.2 seconds just to log in. And the p99 is 4.5 seconds -- 1% of users (5 people) wait over 4.5 seconds. If your SLA says p95 < 1 second, this test FAILS, even though the average looks good.

The Gap Between p95 and p99

A large gap between p95 and p99 is a red flag. It means a small percentage of requests are experiencing something very different from the majority. Common causes: garbage collection pauses, connection pool waits, cache misses (most requests hit cache, a few hit the database), and cold starts in serverless environments.

When reporting results, always include at least p50, p95, and p99. If stakeholders only want one number, give them p95. It represents what the vast majority of users experience while still accounting for slower requests. If they push back with "the average is fine," show them the p99 and explain that those are real users having a terrible experience.

Percentile Math is Not Averageable

You cannot average percentiles across multiple endpoints. If API A has p95 = 100ms and API B has p95 = 200ms, the combined p95 is NOT 150ms. It could be anything from 100ms to 200ms depending on the distribution. If you need a combined percentile, you must merge the raw data and recalculate.

Many reporting tools show averaged percentiles by default. This is mathematically wrong and can mask serious problems. Always verify that your tool calculates percentiles from raw data, not by averaging pre-calculated percentiles from different time windows or endpoints.

Q: Why do we use percentiles instead of average response time in performance testing?

A: Averages are misleading because they hide the distribution. If 99 requests take 100ms and 1 takes 10 seconds, the average is 199ms -- looks fine, but one user had a terrible experience. Percentiles show the real picture: p95 = 100ms means 95% of users had a great experience, p99 = 10,000ms reveals that 1% had an awful one. I always report p50 (typical experience), p95 (industry standard for SLAs), and p99 (near-worst case). The gap between p95 and p99 is diagnostic -- a large gap indicates outlier issues like GC pauses or cache misses. You also cannot average percentiles mathematically -- they must be recalculated from raw data.

Key Point: Averages lie. Percentiles tell the truth. p95 is the industry standard for SLAs. A large gap between p95 and p99 is a red flag. Never average percentiles -- recalculate from raw data.

Key Point: Averages lie, percentiles tell the truth. Use p95 for SLAs, watch the p95-p99 gap for outlier issues.

Previous Up NextThroughput and Concurrency

Chapter 2: Key Metrics