RCA Deep-Dive

Finding latency bottlenecks with Jaeger: a real case study

Priya Nakashima
Abstract trace waterfall visualization with a highlighted bottleneck

This is the write-up of a specific latency investigation we ran on behalf of a growing logistics platform, Saltwhip Labs, using Devloom's trace correlation against their Jaeger backend. The investigation took about 40 minutes from alert to confirmed root cause. I'm writing it up because the actual debugging path — not the tidy post-incident version, but the real sequence of dead ends and pivots — is instructive in ways that abstract tracing tutorials aren't.

The setup: Saltwhip Labs runs a microservice architecture with about 70 services, all instrumented with OTEL and shipping traces to Jaeger via OTEL Collector. Their payments service handles authorization and capture. Someone on their platform team noticed during a routine capacity review that P99 latency for the /v1/authorize endpoint had climbed from around 180ms to 480ms over the previous three weeks. No alerts had fired — the alert threshold was at 2000ms, which is a story in itself. No errors. No obvious deploy correlation in the deployment timeline.

Starting with the wrong hypothesis

The first thing we did was wrong, and I think it's worth naming that explicitly. Because P99 latency had risen with no error increase, the initial hypothesis was database query degradation — specifically, a missing index on a table that had grown significantly. This is a reasonable hypothesis for this symptom profile. We spent about 15 minutes correlating trace spans for database calls in the payments service against the latency distribution, looking for a query that had gone from fast to slow.

The database call spans were fine. Median 12ms, P99 22ms, distribution tight. Nothing had changed. We had burned 15 minutes on a wrong hypothesis because the symptom (P99 latency regression, no errors) pattern-matched to a known class of problem that turned out not to apply here.

This is why starting from a Jaeger waterfall rather than a hypothesis is the right first move for latency investigations. The waterfall shows you the span breakdown for a specific trace instance, which tells you where the time actually went — not where you think it went based on what you've seen before.

Reading the trace waterfall

We pulled ten traces for /v1/authorize from the high-latency bucket (above 400ms P99) and opened them side by side. Jaeger's waterfall view renders each span as a horizontal bar, positioned on the time axis relative to the trace start. Spans are nested to show parent-child relationships.

The pattern across all ten traces was identical: the root span (POST /v1/authorize) at ~460ms total, most of it in two child spans. The first child — db.query payments.authorize_check — was 15ms. The second child — and this is where it got interesting — was labeled http.client risk-screening and was running at 380ms to 420ms consistently.

A 380ms synchronous HTTP call in the hot path of a payment authorization. That's the bottleneck.

The span tags told us the URL: an external risk-scoring API endpoint. This API was being called synchronously, inline, as part of every authorization request. If this API was slow, the authorization endpoint was slow by exactly that amount, with no circuit breaker and no timeout short enough to cap the damage.

The undocumented synchronous call

When we brought this finding to Saltwhip's platform team, the reaction was "that shouldn't be synchronous." The risk-scoring API call had been added eight weeks earlier as part of a fraud prevention initiative. The intent was to call it asynchronously and cache the result, using a default "allow" behavior for new sessions without a cached score. What had actually shipped was a synchronous call with a 30-second timeout that blocked the authorization path completely until it returned.

The eight-week lag between the code change and the latency discovery is explained by the risk-scoring API's own performance: for the first five weeks, it responded in under 30ms. Then the vendor upgraded their model, which added roughly 350ms to their inference time. That change propagated directly into the payments service P99.

No alerting caught this because: (1) the latency threshold was set high; (2) the authorization success rate didn't change; (3) there were no errors logged by either service. The only observable signal was the P99 latency drift, and the system wasn't instrumented to surface that as an alert at the right level.

What Jaeger showed vs. what it didn't show

The Jaeger trace was essential for identifying the slow span. Without it, the investigation would have been a series of hypotheses tested one at a time: is it the database? Is it a specific geographic region? Is it a particular payment method? With the trace waterfall, the bottleneck was visible in the first ten traces we examined.

What the trace didn't show: the business context. Knowing that http.client risk-screening was slow didn't tell us that this call was supposed to be async, or that there was a caching layer that had been bypassed. That required code review and conversation with the engineer who built the feature. Traces give you the timing truth; they don't explain the design intent gap.

This is the boundary of what distributed tracing solves on its own. It's excellent at "which span is slow and where in the call graph does it sit." It's not a substitute for understanding the service's intended behavior. For latency investigations, the trace is the fastest way to identify what's slow — but the fix requires someone who understands what the code was supposed to do.

The span tags that made this faster

The investigation was faster than it might have been because Saltwhip's OTEL instrumentation included useful span tags on their HTTP client spans: the full request URL, the response code, the connection pool status, and the SSL handshake duration. These were set up as part of their OTEL Collector pipeline using semantic conventions from the OTEL HTTP semconv spec.

Without the URL in the span tags, we would have known there was a slow HTTP call but not where it was going. We might have caught it in logs, but log-to-trace correlation across a 30-minute latency investigation window is slower than having the URL directly in the span. OTEL's HTTP semantic conventions exist specifically to make this kind of investigation possible — http.url, http.method, http.status_code, http.response_content_length as standard span attributes.

Teams that instrument with OTEL but strip the URL from spans for privacy or security reasons need a compensating mechanism. A hash of the host is enough to distinguish "this is calling vendor A" from "this is calling vendor B" without exposing the full endpoint. Completely stripping external URL information from traces removes the signal that makes cases like this tractable.

The fix and the prevention

The immediate fix was straightforward: add a 200ms timeout to the risk-scoring API call, add a circuit breaker to fail open (allow the transaction) when the API is unavailable, and log a warning when the fallback fires. Payments continued working; the authorization latency returned to baseline within the deployment window.

The more durable fix was a service-level objective on the risk-scoring span duration specifically. If that span exceeds 100ms for more than 1% of requests in a five-minute window, it fires an alert directly to the team that owns the vendor integration — not to the payments on-call. This is the right owner: the latency is caused by a vendor, and the owner of that vendor relationship should know first.

The investigation yielded a second insight: Saltwhip had no alerting on P99 latency regression below their error-threshold ceiling. Adding a P99 latency change alert — alerting on 20% regression week-over-week rather than on absolute thresholds — would have surfaced this three weeks earlier. Sometimes the most valuable finding from a trace investigation isn't the root cause of the specific incident. It's the gap in observability coverage that let the incident grow undetected.