Engineering

Causal inference vs correlation in SRE workflows

Marcus Webb
Abstract two-path visualization representing correlation versus causation

"Correlation doesn't imply causation" is the kind of phrase that sounds obvious when you say it in a statistics class and becomes less obvious at 3am when your checkout service error rate just crossed 5% and you're staring at three Grafana panels trying to figure out which one caused which.

This isn't an abstract statistics problem for SRE teams. It has a direct operational translation: if you mistake correlation for causation during an incident, you roll back the wrong deploy, restart the wrong service, or apply a mitigation that doesn't address the actual failure and may make things worse. The epistemics of causation aren't just academically interesting — they determine whether your incident response is fast and correct or slow and self-defeating.

Why correlation misleads in distributed systems specifically

In a simple system, correlation and causation often align closely enough to be operationally equivalent. If a database falls over and your application's database query latency spikes, those are correlated, and the correlation points directly at the cause.

Distributed systems break this alignment for several reasons. First, shared infrastructure creates spurious correlation. Two completely independent services might both degrade at the same time because they share a network path, a load balancer, or a DNS resolver — not because one caused the other. Their metrics are correlated. The correlation says nothing meaningful about the causal relationship.

Second, cascading failures create correlation chains where the origin is several hops back. Service A's latency spikes, which causes service B's timeout rate to increase, which causes service C's error rate to rise. A, B, and C are all correlated. Only A is causal. The correlation tells you the blast radius; it doesn't identify the epicenter.

Third, confounders create apparent causal relationships that don't exist. A traffic spike causes both an increase in CPU on your order service and an increase in latency on your payment service. The CPU and latency are correlated. The CPU didn't cause the latency — both are effects of the traffic spike. If you investigate CPU as the cause of payment latency, you'll spend 20 minutes going in the wrong direction.

The operational framework for causal reasoning

Causal inference in production incidents doesn't require formal statistical methods. It requires applying three tests that are tractable under pressure.

Temporal precedence. A cause must precede its effect. If service B's latency degraded before service A's latency degraded, then A didn't cause B's latency — even if they're correlated and even if A is upstream of B in your dependency graph. Temporal ordering is the cheapest causal filter and should always be applied first. This is why precise timestamps on deployment events, configuration changes, and metric deviations matter: they let you establish what happened before what.

Mechanism plausibility. There needs to be a plausible mechanism by which the hypothesized cause would produce the observed effect. "CPU utilization on order-service increased therefore payment-service latency increased" requires a mechanism: order-service is downstream in the payment call chain, and its latency directly adds to payment response time. If there's no plausible mechanism — if the services are completely independent with no shared call paths — then correlation is coincidental.

The mechanism check is where the dependency graph is valuable. Without knowing the actual call relationships between services, every temporal correlation is a candidate cause. With the dependency graph, you can immediately filter to the set of services that have a plausible mechanism to affect the observed symptom.

Counterfactual reasoning. Would the effect have occurred without the hypothesized cause? This is harder to evaluate during an incident but often tractable in retrospect. If the deploy you're considering rolling back went to a service that the affected user flow doesn't use, the deploy is probably not the cause regardless of timing correlation.

A worked example: the wrong rollback

A software company running an e-commerce platform observes elevated checkout failure rates at 22:15 on a Tuesday. Two deploys had occurred that day: a frontend update to the checkout UI at 18:00 and an inventory-service update at 21:45. The checkout failure rate started rising at 22:10.

The temporal correlation points at inventory-service (deployed 25 minutes before the failure rate started rising). The incident commander initiates a rollback of inventory-service. The failure rate doesn't improve.

Forty minutes later, after reviewing the traces, the team finds that checkout failures are concentrated on orders with more than 10 line items. The 18:00 frontend update changed how the checkout UI batches line item requests — batches over 10 items now trigger a code path that has a regression. The inventory-service rollback was pure false causation from temporal correlation. The deploy was recent, and recent + correlated was mistaken for causal.

The counterfactual question would have helped here: does the inventory-service update have a plausible mechanism to cause failures only on multi-line-item orders? Almost certainly no — inventory checks are per-item and don't have different behavior based on order size. Asking the mechanism question before rolling back would have focused the investigation on the correct code path.

What causal inference looks like in practice at Devloom

The formal methods for causal inference — do-calculus, counterfactual analysis, structural causal models — are well-established in statistics research. Applying them at incident speed requires operationalizing the core ideas without the mathematical apparatus.

Our approach is: for each candidate cause, explicitly ask temporal precedence (did it happen before the symptom onset?), mechanism (does it have a plausible path to the observed service?), and directionality (is the correlation consistent with this being the origin or just a downstream effect?). We answer these using deployment event timestamps, the live dependency graph, and trace data showing which span in the call chain is anomalous.

The trace data is where directionality evidence lives. If traces show that order-service is the span with anomalous latency and all of checkout-service's latency increase is explained by the time spent waiting for order-service calls, that's strong directionality evidence: the cause is in order-service, and checkout-service is a downstream victim. Without the trace-level evidence, you'd have two correlated services and no directionality signal.

Building the habit

Causal reasoning is a skill that degrades under pressure unless it's been practiced as a habit. The engineers who are fastest at incident resolution aren't necessarily smarter — they're the ones who have internalized a causal reasoning framework deeply enough that they apply it automatically rather than falling back to "most recent deploy wins."

We're not saying every incident needs a formal causal analysis. For straightforward incidents where the evidence is clear, operational judgment is sufficient and faster. The framework matters for the incidents where evidence is ambiguous and the instinct to act on the most salient correlation would lead you wrong.

The asymmetry is worth noting: applying causal reasoning to a case where simple correlation would have also worked costs you maybe 2-3 minutes. Applying simple correlation reasoning to a case where causal inference was required costs you 30-60 minutes of wrong-tree investigation plus the additional user impact of a delayed resolution. That asymmetry is why causal reasoning is a professional skill worth developing, not just a statistical nicety.