Root Cause Analysis Workflow

You have all the pieces now -- JMeter/Gatling reports, server metrics, APM traces, profiling data. But how do you put them together systematically? Without a structured approach, performance analysis becomes random poking around. You check CPU, then memory, then the database, then the network, then back to CPU -- going in circles. A systematic root cause analysis workflow saves time and prevents you from chasing ghosts. Think of it like a doctor diagnosing a patient: they do not randomly order MRIs and blood tests -- they follow a decision tree based on symptoms.

The Performance Triage Workflow

Systematic Root Cause Analysis

START WITH THE TEST REPORT: What does the data say? High response times? Low throughput? High errors? Identify the primary symptom.

NARROW TO THE ENDPOINT: Which specific request/transaction is the bottleneck? Use the Aggregate Report or APM transaction list.

DETERMINE THE TYPE: Is it latency (slow responses), throughput (low capacity), or errors (failures)? Each has a different investigation path.

CHECK THE APPLICATION SERVER: CPU, memory, GC, thread count. Is the app server itself the bottleneck?

CHECK THE DATABASE: Query performance, connection pool, lock contention, replication lag. Is the DB the bottleneck?

CHECK EXTERNAL DEPENDENCIES: API calls to third-party services, message queues, caches. Is a dependency the bottleneck?

CHECK THE NETWORK: Latency between components, bandwidth utilization, DNS resolution. Is the network the bottleneck?

CHECK THE LOAD GENERATOR: CPU, memory, network on your test machine. Is the test itself flawed?

CORRELATE AND CONFIRM: Match the timing of the bottleneck event with the symptom in the test report. Do they line up?

DOCUMENT THE ROOT CAUSE: Write down what you found, with evidence (metrics, timestamps, logs).

The Decision Tree

Instead of checking everything, use this decision tree to narrow down quickly. Start with the most common cause and work outward.

Bottleneck Decision Tree

Response time high?

Start here. If no, check errors.

↓

App server CPU > 80%?

Yes → CPU bottleneck. Profile the code.

↓

App CPU low, DB CPU high?

Yes → Slow queries. Check slow query log.

↓

Both CPUs low?

Waiting for something. Check connection pools, locks, external APIs.

↓

Memory growing over time?

Yes → Memory leak. Capture heap dump.

↓

All endpoints equally slow?

Network or infrastructure issue. Check latency, bandwidth.

Correlation Between Metrics -- Connecting the Dots

The most powerful analysis technique is correlation -- looking at two metrics that changed at the same time. If response times spiked at 14:32 and GC pauses started at 14:32, those events are likely related. If throughput dropped at 14:45 and database connection pool utilization hit 100% at 14:45, you have found your bottleneck. The key is timestamps -- align all your data sources to the same clock.

If You See This...	And Also This...	The Bottleneck Is Probably...
Response times rising linearly	App server CPU > 85%	CPU-bound code. Profile and optimize.
Response times rising linearly	App CPU low, DB CPU high	Slow database queries. Check slow query log.
Periodic response time spikes	GC pause duration spikes at same times	Garbage collection. Tune JVM heap or fix memory leak.
Sudden response time jump	Connection pool active = max at same time	Connection pool exhaustion. Increase pool or fix leak.
Response times degrading over hours	Memory usage steadily climbing	Memory leak. Heap dump analysis needed.
All endpoints slow simultaneously	Network latency between app and DB increased	Network issue. Check infrastructure, routing.
Throughput dropped but response times OK	Error rate increased at same time	Server rejecting requests. Check rate limiters, circuit breakers.
Throughput dropped, response times spiked	Disk I/O wait % high	Disk bottleneck. Check logging volume, DB writes.

Real-World Analysis Example

Let us walk through a realistic example end-to-end. You ran a load test on an e-commerce application: 500 users, 30-minute duration, ramp-up over 5 minutes.

Investigation Walkthroughtext

STEP 1: JMeter Report Dashboard
───────────────────────────────
Total requests: 245,000
Error rate: 3.2%
Avg response time: 1,850ms
p95 response time: 6,200ms
→ Verdict: Bad. Error rate too high, p95 way above SLA of 2 seconds.

STEP 2: Aggregate Report (sorted by p95 response time)
───────────────────────────────────────────────────────
/api/checkout    p95=12,400ms   errors=8.1%    ← WORST
/api/search      p95=4,800ms    errors=1.2%
/api/cart        p95=1,200ms    errors=0.3%
/api/products    p95=380ms      errors=0.0%
→ Verdict: Checkout is the bottleneck. Investigate this endpoint.

STEP 3: Response Times Over Time
────────────────────────────────
Checkout response times were fine for first 8 minutes (200ms avg),
then started climbing at minute 8 (200 concurrent users),
reached 5 seconds by minute 15 (400 users),
and 12+ seconds by minute 20 (500 users).
→ Verdict: Degradation started at ~200 users. Something saturated.

STEP 4: Server Metrics at Minute 8
──────────────────────────────────
App server CPU: 45%    ← Not the bottleneck
App server memory: 68% ← Fine
DB server CPU: 92%     ← RED FLAG!
DB connections active: 18 of 20 ← Almost full
→ Verdict: Database is the bottleneck.

STEP 5: Slow Query Log (queries during minute 8-30)
───────────────────────────────────────────────────
Query: SELECT * FROM inventory WHERE product_id = ? FOR UPDATE
Avg time: 2,400ms | Calls: 12,500 | Total time: 30,000s
Explain plan: Full table scan on inventory (500,000 rows)
→ ROOT CAUSE: Missing index on inventory.product_id,
   combined with row-level locking (FOR UPDATE) causing
   massive lock contention under concurrent checkout load.

STEP 6: Recommended Fix
───────────────────────
1. Add index: CREATE INDEX idx_inventory_product ON inventory(product_id)
2. Reduce lock scope: Lock only the specific row, not the scan range
3. Increase connection pool to 30 (short-term)
4. Re-test after fix to verify improvement

Q: Walk me through your approach to root cause analysis after a performance test.

A: I follow a top-down approach. First, I look at the test report summary -- error rate, response times, throughput -- to understand the severity. Then I drill into the Aggregate Report to identify which specific endpoints are problematic. I sort by p95 response time rather than average because averages hide tail latency. Once I identify the slow endpoints, I look at the Response Times Over Time chart to determine when the degradation started and correlate that with the user count. Then I move to server-side investigation: I check app server CPU, memory, and GC activity. If the app server is healthy, I check the database -- slow query logs, connection pool utilization, and lock contention. If an APM tool is available, I look at the trace waterfall for the slow endpoint to see exactly which operation takes the most time. The key principle is correlation -- I align timestamps across all data sources. If response times spiked at 14:32 and the database connection pool hit 100% at 14:32, that is my root cause. I document the finding with evidence: the metric, the timestamp, the threshold crossed, and the recommended fix.

Key Point: Root cause analysis is about correlation: find the symptom in the test report, match the timestamp to a server-side metric crossing a threshold, and trace it to the specific component (code, query, connection pool, network) that caused it.

Key Point: Follow a systematic workflow: start with the test report, narrow to the endpoint, check app server, database, dependencies, and network in order. Correlation across timestamps is the key technique.

Previous Up NextPresenting Findings to Stakeholders

Chapter 8: Analyzing Results and Bottleneck Identification

Root Cause Analysis Workflow

Prev Next