RCA Deep-Dive

Why alert correlation is not root cause analysis

Marcus Webb
Abstract visualization representing alert signal noise vs causal root cause

Spend enough time in incident retrospectives and a pattern emerges. Someone will say "the database alert fired right before the API timeout alert, so we investigated the database first." It sounds reasonable. It turns out to be wrong about 60% of the time.

Alert correlation — grouping or sequencing alerts that fired close together in time — is useful for reducing notification noise. It is not root cause analysis. Conflating the two is one of the most expensive habits in on-call work, and most observability platforms actively encourage it.

What correlation actually tells you

When your monitoring system clusters three alerts together because they all fired within a 90-second window, it's telling you about temporal proximity. That's it. Temporal proximity between alerts is evidence that something happened — not evidence of what caused it, or in what direction causality flows.

Consider a common scenario: you're running a set of microservices backed by a shared PostgreSQL instance. A bad deploy goes out to your order-processing service. That service starts hammering the database with unindexed queries due to a missing WHERE clause on a newly added code path. Within two minutes you get: a database CPU alert, a slow-query alert, an elevated error rate on the payments service (which shares read replicas), and a p99 latency alert on the checkout API.

Your correlation tool groups all four. The database alerts arrived first. Without additional context, a reasonable engineer on call starts investigating the database. They find nothing wrong with the database itself — because nothing is wrong with it. The load is real, the queries are legitimate from the database's perspective, the problem is entirely in the application layer. Forty minutes later, after restarting replicas and calling the DBA at 2am, someone finally diffs the recent deploys and finds the missing index guard.

The correlation was accurate: all four alerts were causally related to the same event. But the grouping gave no indication of direction. It didn't tell you which service was the origin. It didn't tell you that a deploy caused it. It showed you the blast radius, not the epicenter.

The direction problem

Root cause analysis requires directionality. You need to know not just "these things happened together" but "this thing caused those things." That's a structurally different problem from correlation.

Directed causality in distributed systems comes from a few places: deployment events, configuration changes, dependency changes (a third-party API degrading), resource exhaustion that cascades, or traffic pattern shifts. Identifying which of these is at play requires cross-signal reasoning — connecting a Kubernetes deployment event timestamp to a spike in error rate to a change in trace span duration — in a way that preserves the direction of the causal arrow.

Correlation tools don't do this. They group by time proximity. Some newer tools add rules-based grouping ("alerts on the same host" or "alerts matching the same service tag"), which is better, but still doesn't tell you which direction the causal chain flows. Rules-based grouping is still correlation, just with more structure.

How platforms obscure this distinction

Most APM and observability vendors market "root cause analysis" as a feature when what they've built is sophisticated alert correlation. The marketing is understandable — "we cluster alerts for you" doesn't sell dashboards. But the semantic gap creates real operational problems.

We're not saying correlation is useless. Noise reduction has genuine value. If you're getting 200 alerts per incident and a correlation engine reduces that to 8 grouped clusters, that's meaningful time saved. The problem comes when engineers treat the grouped cluster as the answer rather than the starting point. The platform told me these four alerts are related — I should investigate them. That part is fine. The problem is assuming the first alert in the group is the cause.

Some platforms have started adding "probable cause" labels to alert groups. These are typically derived from simple heuristics: the alert that fired earliest, the service with the highest error rate, or a rule someone wrote months ago. They tend to be right in the most obvious cases — which are also the cases where a competent engineer would have found root cause in five minutes anyway. For the actual hard incidents, they're noise with a confidence label on it.

What root cause analysis actually requires

Genuine RCA in a distributed system needs three things working together:

Change context. What changed before the incident? Deploys, config pushes, feature flag flips, certificate renewals, dependency version bumps. Without this, you're pattern-matching symptoms against symptoms.

Causal graph traversal. Given that service A is degraded and service B is downstream, which direction is the call flow? If A's latency increased after a deploy to A, and B is calling A, then the causal direction is A → B. You need the service dependency graph to reason about this correctly.

Evidence accumulation, not just signal presence. A root cause hypothesis should be supported by converging evidence from multiple signal types — traces showing which span is slow, metrics showing when the deviation started, logs surfacing the specific error, deployment history showing what changed at that timestamp. Each of these is independently necessary; none is sufficient alone.

Alert correlation gives you the third point in incomplete form: it surfaces signals. But it doesn't connect them to change context, and it doesn't traverse the dependency graph. It leaves the hardest parts of RCA entirely to the on-call engineer.

Why this matters operationally

We built Devloom specifically because we kept hitting this problem in practice. The observability data existed — traces in Jaeger, metrics in Prometheus, deployment events in the CI/CD pipeline — but there was no automated path from "these alerts fired" to "this deploy caused it, here's the affected call graph." Every incident required a human to manually connect the dots across four or five different UIs.

The average triage time in that model, for incidents that weren't immediately obvious, was 35-50 minutes from first alert to identified root cause. Most of that time wasn't data gathering — the data was there. It was the causal reasoning work: ruling out the database when the real problem was in application code, tracing which service was the origin of a cascading failure, correlating a deploy timestamp with the moment metrics degraded.

Alert correlation reduces alert volume. It does not reduce reasoning time. Those are different problems, and you need solutions for both.

A practical rubric for evaluating your tooling

Next time you're evaluating an observability platform's RCA claims, ask two questions: Does it connect alert onset to a specific change event (deploy, config, dependency)? And does it traverse the dependency graph to identify origin service vs. downstream affected services?

If the answer to both is no, what you have is alert correlation with an RCA brand. That's still valuable — just don't rely on it to do the part it can't do.

The engineers on your team who are good at incidents aren't good because they're fast at reading dashboards. They're good because they reason about causality quickly and correctly. The tooling should help with that reasoning, not substitute alert grouping for it.