Chapter 8: Analyzing Results and Bottleneck Identification
So far we have been doing detective work with server metrics and log files. That is like solving a crime by interviewing witnesses and checking security cameras. APM (Application Performance Monitoring) tools are like having a body cam on every suspect -- they instrument your application code and trace every request from entry to exit, showing exactly where time is spent. If you have ever wished you could see inside a request and know "200ms was spent in the database query, 50ms in JSON serialization, and 800ms waiting for the payment gateway", APM tools do exactly that.
An APM tool instruments your application (usually via an agent/library you add to the runtime) and collects detailed traces. Each trace represents a single request flowing through your system, broken into "spans" -- one span for the controller method, one for the database query, one for the external HTTP call, and so on.
| APM Tool | Best For | Key Feature | Pricing Model |
|---|---|---|---|
| New Relic | Full-stack visibility, enterprise teams | Distributed tracing + infrastructure monitoring in one UI | Per-host or per-GB ingested, free tier available |
| Datadog | Cloud-native apps, DevOps teams | Excellent dashboards, broad integration ecosystem | Per-host + per-feature pricing, free tier for basic APM |
| Grafana + Prometheus + Tempo | Open-source enthusiasts, cost-conscious teams | Fully open-source stack, self-hosted, no vendor lock-in | Free (self-hosted). Grafana Cloud has paid tiers. |
| Dynatrace | Large enterprises, auto-discovery | AI-powered root cause analysis (Davis AI) | Per-host, typically expensive |
| Jaeger | Distributed tracing only | Open-source, CNCF project, Kubernetes-native | Free (self-hosted) |
| AppDynamics | Enterprise Java/.NET shops | Business transaction mapping, auto-baseline | Per-agent, enterprise pricing |
The real power of APM emerges when you combine it with load testing. Run your JMeter/Gatling test, then open your APM tool and look at the same time window. You will see things that server metrics alone cannot reveal: which specific code methods are slow, which database queries are called most often, which external services add latency, and whether the problem is in your code or in a dependency.
Ensure APM agent is installed and configured on the application server. Verify it is reporting data.
Note the exact start time of your load test. You will use this to filter the APM data.
Run your load test as normal (JMeter/Gatling).
After the test, open the APM tool and set the time window to match your test duration.
Go to the "Transactions" or "Services" view. Sort by response time or error rate.
Click on the slowest transaction. Look at the trace breakdown -- which spans took the most time?
Check the "Database" view for slow queries. The APM will show the actual SQL, execution time, and call count.
Check the "External Services" view to see if any downstream API calls are slow.
Correlate APM findings with your JMeter/Gatling report. The slow endpoints should match.
Export or screenshot the trace waterfall for your report to stakeholders.
If your team uses the open-source stack (Grafana + Prometheus + Tempo), you can build custom dashboards that combine your load test metrics with server metrics in one view. This is incredibly powerful for real-time monitoring during a test. You can literally watch your server metrics react to the load test traffic in real time.
# Panel 1: Application Response Time (from APM/Prometheus)
# PromQL: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Panel 2: Request Rate (throughput)
# PromQL: rate(http_requests_total[5m])
# Panel 3: Error Rate
# PromQL: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# Panel 4: CPU Usage
# PromQL: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Panel 5: Memory Usage
# PromQL: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Panel 6: DB Connection Pool (if using HikariCP with micrometer)
# PromQL: hikaricp_connections_active
# Panel 7: JVM Heap Usage
# PromQL: jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} * 100
# Panel 8: GC Pause Duration
# PromQL: rate(jvm_gc_pause_seconds_sum[5m])
# Arrange these 8 panels in a 4x2 grid. During the load test,
# have this dashboard open on a second monitor. You will see
# the exact moment the system starts struggling.Profiling is the nuclear option of performance analysis. It instruments every method call in your application and shows exactly where CPU time is spent. Profiling has significant overhead (10-30% performance impact), so you never do it in production. But on a staging environment during a load test, it is invaluable. A flame graph from a profiler during a load test is the most precise diagnostic tool in your arsenal.
If you only learn one profiling skill, learn to read flame graphs. A flame graph shows the call stack on the Y-axis and CPU time on the X-axis. Each box is a method. The wider the box, the more CPU it uses. Look at the widest boxes near the top of the stack -- those are your application methods where time is spent. Boxes at the bottom are framework/library code you usually cannot change.
Q: What APM tools have you used, and how did they help you during performance testing?
A: I have worked with New Relic and the Grafana-Prometheus stack. During performance testing, I use APM tools to bridge the gap between "what is slow" (from JMeter reports) and "why is it slow" (from server-side visibility). For example, in one project, JMeter showed that the order placement API had a p99 of 8 seconds. The APM trace waterfall revealed that 6 of those 8 seconds were spent in a single database query that was doing a full table scan on a 10-million row table. Without the APM trace, we would have spent days adding log statements and guessing. With Grafana, I build dashboards that combine load test metrics with server metrics so the team can watch the test in real time. This is especially useful during stress tests -- you can see the exact user count where CPU hits 90% or where the connection pool gets exhausted.
Key Point: APM tools trace individual requests through your system, showing exactly where time is spent. Combine APM traces with load test results to go from "which endpoint is slow" to "which line of code or SQL query is slow" in minutes.
Key Point: APM tools trace requests through the system and show exactly where time is spent. Use them alongside load tests to bridge the gap between symptoms (slow responses) and root cause (specific code or queries).