Error Rate and Reliability

A system that responds in 50ms but returns errors for 10% of requests is worse than one that responds in 500ms with zero errors. Error rate measures reliability -- the percentage of requests that fail. Under normal load, error rate should be near zero. As load increases, errors should still stay below your threshold. When errors spike, you have found either the breaking point or a bug.

Types of Errors in Performance Tests

Error Type	HTTP Code	What It Means	Common Cause
Server Error	500	Server crashed or threw an exception	Unhandled error under load, out of memory
Service Unavailable	503	Server is overloaded or down	All workers busy, auto-scaling not fast enough
Gateway Timeout	504	Upstream server did not respond in time	Backend service too slow under load
Too Many Requests	429	Rate limiting activated	Legitimate protection -- your test is hitting rate limits
Connection Timeout	N/A	Could not establish a connection	Connection pool exhausted, network issue
Socket Timeout	N/A	Connected but response never came	Server hung, deadlock, infinite loop

Error Rate Thresholds

What is an acceptable error rate? It depends on the test type and the operation. Financial transactions should have near-zero errors. Product browsing can tolerate a small percentage.

Context	Acceptable Error Rate	Why
Load test (expected users)	< 0.1%	Under normal load, the system should be reliable
Stress test (above expected)	< 5% at 2x load	Some degradation is expected beyond capacity
Financial transactions	< 0.01% (near zero)	Lost transactions mean lost money and compliance issues
Content browsing	< 1%	A failed page load is annoying but not catastrophic
File uploads	< 0.5%	Users will retry but too many failures erode trust

Error Patterns to Watch

Sudden spike at a specific user count -- you hit the breaking point. Increase resources or optimize.
Gradual increase over time -- resource leak (connections, memory). Run a soak test to confirm.
Errors only on specific endpoints -- that endpoint has a specific bottleneck. Profile it.
Errors during ramp-up only -- cold start issues. Warm up caches and connection pools before accepting traffic.
Random intermittent errors at any load -- flaky dependency, network issues, or race conditions.

If your performance test shows 429 (Too Many Requests) errors, stop and check if the target system has rate limiting. Testing against rate-limited endpoints gives meaningless results -- you are testing the rate limiter, not the application. Either disable rate limiting in the test environment or adjust your test to stay within limits.

Q: What error rate is acceptable in performance testing and how do you investigate errors?

A: Under expected load, error rate should be below 0.1% -- near zero. During stress testing, up to 5% at 2x expected load is acceptable. For financial transactions, near zero is mandatory. When I see errors, I categorize them: 500 errors indicate application crashes (check logs for exceptions), 503 means server overload (check worker/thread pool metrics), 504 means upstream timeout (identify slow dependency), 429 means rate limiting (adjust test or environment). I also look at error patterns -- sudden spike at a user count means capacity limit; gradual increase means resource leak; errors on specific endpoints means targeted bottleneck.

Key Point: Error rate should be near zero under expected load. Categorize errors by HTTP code, analyze patterns (sudden vs gradual vs endpoint-specific), and distinguish between application bugs and capacity limits.

Key Point: Error rate should be <0.1% under normal load. Categorize errors by HTTP code and analyze patterns to find root causes.

Previous Up NextThink Time and Pacing

Chapter 2: Key Metrics