Incident Engineering

Mean time to innocence: the real incident metric

Marcus Webb March 18, 2025

Abstract clock and signal visualization representing incident response time

Everyone in SRE knows MTTR. Mean time to resolution is on every post-mortem template, every quarterly reliability review, every engineering dashboard that someone checks when things go wrong. It's a fine metric for tracking overall incident duration. It's a terrible metric for understanding where you're actually losing time during incidents.

The metric that actually describes the pain is MTTI: mean time to innocence. It's the time from first alert to the moment each team involved can definitively say "my service is not the root cause." It's the time your payment team spends investigating their service before ruling it out and passing the incident to the infrastructure team. It's the time your database team spends pulling query logs before confirming that query performance is fine and the problem is upstream. It's the time spent ruling things out rather than ruling things in.

In distributed systems with microservice architectures, MTTI is frequently where the majority of incident time goes. And almost no one tracks it.

Why MTTR hides the problem

MTTR is end-to-end: alert fires, incident is resolved, time elapsed. It tells you the headline number but nothing about the composition. A 45-minute MTTR might consist of: 5 minutes to acknowledge, 32 minutes of triage across three teams, and 8 minutes to apply the actual fix. Or it might be: 5 minutes to acknowledge, 3 minutes to identify root cause, and 37 minutes to coordinate the remediation and get the deployment through review.

These are completely different problems requiring completely different solutions. But MTTR treats them as equivalent.

The reason MTTI matters is that in microservice environments, triage time — specifically, the time spent ruling out services that are symptomatic but not causal — tends to dominate. A degraded downstream service pages its owner. That owner investigates and eventually determines their service is healthy but dependent on an upstream that's slow. They hand off. The upstream owner investigates. And so on, until someone reaches the actual origin.

In a well-instrumented system with clear causal chains, this handoff should take minutes. In practice, without tooling that can traverse the dependency graph and identify the origin, each handoff involves independent investigation rather than directed investigation. The result is that three or four teams are all separately building the same picture that one team with the right context could have built immediately.

Measuring MTTI in practice

MTTI requires some definition work before you can measure it, because "innocence" isn't a natural event in most incident management systems. PagerDuty doesn't have an "I'm not the cause" button.

The practical proxy is team handoff timing: when a team receives an incident notification, and when they transfer responsibility or hand off context to another team. The delta is that team's MTTI contribution. Summed across all teams involved in an incident, you get total triage time — which is a reasonable operational proxy for MTTI.

A simpler metric for small teams: just start tracking "time from alert to first confident hypothesis." Not confirmed root cause — just the first moment when someone can credibly say "I think this is service X, I'm going to focus there." The gap between alert and first confident hypothesis is often 15-25 minutes for non-trivial incidents, and it's almost entirely composed of time spent eliminating wrong hypotheses.

The asymmetry that makes this painful

There's an asymmetry in distributed system incidents that makes MTTI particularly insidious: downstream services fail visibly, and upstream services fail silently.

When your payments API is degraded because your order service is slow and your order service is slow because a third-party shipping rate API is timing out, the alert fires on payments. The payments team gets paged. They look at their service — error rates up, latency up, but nothing obviously broken in their code. They look at their dependencies. They find the order service call is slow. They page the order team. The order team looks at their service — also appears healthy, but one outbound API call is running slow. Another handoff.

Meanwhile, total time elapsed: 25 minutes. The actual problem — a third-party API timing out — took 3 seconds to fix once someone had a runbook entry for it. The MTTI was 25 minutes; the actual remediation was 3 minutes. MTTR for this incident is 28 minutes. That's a decent MTTR. The incident was entirely driven by a bad MTTI, but MTTR gives you no visibility into that.

What compresses MTTI

There are three mechanisms that actually reduce time to innocence. Each addresses a different part of the problem.

Service dependency maps that are accurate and queryable. Not a stale architecture diagram in Confluence that hasn't been updated in six months. A live dependency map derived from actual trace data showing what calls what, with latency distributions. When an incident fires, you want to immediately see the propagation path — which service originates this call chain, which services are downstream consumers. This turns handoff from "I'm not sure who I should escalate to" into "the trace shows this originates in order-service, handing off now."

Deploy correlation on by default. When you get an alert, the first question should be automatic: did anything deploy in the last two hours in the affected service or its upstream dependencies? If the answer is yes and the timing correlates, you have a strong first hypothesis before anyone opens a dashboard. This single check — what changed recently near the blast radius — eliminates a large fraction of triage time for deployment-related incidents, which represent a substantial fraction of all incidents in environments with frequent deployment cycles.

Span-level evidence attached to the alert. Not a link to Jaeger that says "go look." The relevant traces — specifically, the traces that are anomalous relative to baseline, in the affected service, at the time of the incident — surfaced in the initial context. The goal is that the engineer acknowledging the alert can see, before opening a single additional tool, which span is slow and what changed. That context compression is what reduces "let me go investigate" from a 20-minute exercise to a 3-minute exercise.

MTTI as a reliability target, not just a diagnostic

Once you start measuring MTTI (even as a proxy via handoff timing), something useful happens: you can set reliability targets against it separately from MTTR. An MTTR target of 30 minutes with no MTTI target means you might be hitting 30-minute MTTR because you have excellent remediation automation but terrible triage, or vice versa. You don't know which investment to make next.

With separate MTTI and MTTD (mean time to detect) and post-MTTI remediation time, you can identify where improvement will have the highest impact. For most teams with 50+ services and frequent deployment cycles, the bottleneck is in triage — and specifically in the ruling-out work that MTTI captures. That's where tooling investment returns the most.

We're not saying MTTR is a bad metric. It's a good summary metric. The argument is that it's insufficient as a diagnostic — it tells you there's a problem without telling you which part of the incident process is the bottleneck. MTTI makes the triage phase visible and measurable, which is the first step to actually improving it.

Track the time to innocence. It's where the time goes.