RCA Deep-Dive

How to map blast radius in distributed systems

Marcus Webb June 3, 2025

Abstract visualization of an incident blast radius spreading through a system

Blast radius is one of those concepts that everyone nods along to but few teams have actually operationalized. The theory is clear: when a service fails or degrades, other services that depend on it will also be affected. The practice — knowing in real time, during an incident, exactly which services are currently impacted and to what degree — is substantially harder than it sounds.

In a system with 60 microservices and non-trivial interdependencies, a degradation in a shared internal service can propagate to a dozen downstream consumers before any of them fire an alert. If you're reacting to individual service alerts rather than understanding the propagation graph, you're always behind the incident.

What blast radius actually means operationally

The term comes from chaos engineering and failure mode analysis, but it has a precise operational meaning: given a degradation event at service X, the blast radius is the set of services whose reliability is currently degraded as a direct or transitive consequence of X's failure.

There are two dimensions worth separating. The first is the structural blast radius — which services have X in their call path at all, based on the dependency graph. The second is the live blast radius — which services are currently experiencing degradation above baseline that can be attributed to X's current state. Structural blast radius is static; live blast radius is dynamic and changes as services respond to the degradation (circuit breakers trip, load shedding kicks in, fallbacks engage).

Most post-mortems discuss blast radius retrospectively. What incident engineering actually requires is live blast radius information during the incident itself, before all affected services have alerted. If you know that order-service is degraded and you immediately know that checkout, notifications, and inventory are all downstream consumers with non-trivial dependency ratios, you can proactively engage those teams rather than waiting for their alerts to cascade.

Building the dependency graph

The dependency graph is the foundation. There are two ways to build it: statically (from service registry, API contracts, or infrastructure declarations) and dynamically (from trace data). Each has trade-offs.

Static dependency graphs are incomplete almost by definition in microservice environments where teams ship multiple times a day. A service that added a new downstream dependency last Tuesday may not have updated the architecture diagram. Static graphs are useful as a baseline and for documentation, but shouldn't be the sole source of truth for incident response.

Dynamic dependency graphs derived from distributed traces are more accurate because they reflect actual call relationships, not intended ones. If service A is making calls to service B that aren't in any documentation, the trace data shows it. The trace data also shows you call frequency, latency distributions, and error rates for each edge in the graph — all of which matter for estimating impact severity.

Building a live dependency graph from traces means aggregating span data across a time window (typically the last 24-72 hours to smooth out low-frequency calls) and constructing a directed graph where nodes are services and edges are caller-callee relationships weighted by call volume and error rate. This is the graph you need for blast radius traversal during an incident.

Graph traversal during an incident

Given a degraded service X and a live dependency graph, blast radius traversal is a directed BFS (breadth-first search) from X to all downstream consumers. Each node in the traversal is a potential impacted service; the edge weight (call volume, dependency criticality) determines the estimated impact severity.

In practice, the traversal needs to answer: which downstream services are synchronously dependent on X (direct impact, immediate latency increase), which are asynchronously dependent (indirect impact, degraded data freshness or delayed processing), and which have isolation mechanisms in place (circuit breakers, fallback patterns, bulkheads) that may limit propagation.

This is where a living dependency graph differs from a static diagram. If your trace data shows that service B calls service X via a circuit-breaker pattern and has a fallback response configured, the effective blast radius for B is much smaller than for service C that makes synchronous blocking calls to X with no fallback. Same structural relationship, very different live impact.

A concrete scenario

Consider an e-commerce backend: a product catalog service starts returning high latency responses — P99 at 2.8 seconds versus a 200ms baseline. The catalog service is called by: the search API (synchronously, inline with search results), the recommendation engine (asynchronously, for product enrichment), and the checkout service (synchronously, for inventory validation).

Structural blast radius: three services. Live blast radius depends on how each handles catalog latency. The search API has no circuit breaker; its P99 immediately degrades proportionally. The recommendation engine is async with a 5-second timeout and a cached fallback; it's affected but with delayed impact. The checkout service has a 3-second timeout and a fallback to "proceed without inventory check"; it degrades after the 3-second timeout but doesn't hard-fail.

Without blast radius mapping, three teams get alerted as each service crosses its alert threshold, likely within 5-8 minutes of each other. With blast radius mapping, the moment catalog latency spikes, an engineer knows that search is immediately affected (no isolation), checkout will degrade in 3 minutes, and recommendations have a short grace period. The incident can be communicated to all three teams simultaneously, rather than reactively as each alert fires.

The isolation mechanism inventory

Blast radius mapping is only as useful as your knowledge of isolation mechanisms. A circuit breaker that's supposed to be in service B might be disabled or misconfigured. A fallback cache might be stale. A timeout that was set to 500ms might have been bumped to 10 seconds during a debug session six months ago and never changed back.

This means the isolation mechanism inventory needs to be kept current, and one source of truth for that is — again — trace data. If service B's circuit breaker is correctly configured, you'll see calls to X being rejected (fast-fail) in the trace data when X is degraded. If you see calls to X with 8-second durations instead of fast-fails, the circuit breaker isn't firing. Trace data surfaces misconfigured isolation mechanisms in a way that documentation-based inventories simply can't.

We're not saying you should only rely on trace-derived isolation discovery — explicit service contract documentation still has value, especially for new team members. The point is that the trace data is ground truth for what's actually happening in production, and an automated system that can read that data and flag "isolation mechanism appears non-functional for this edge" is valuable both for incident response and for proactive configuration audits.

Blast radius as a deploy risk signal

One underused application of blast radius analysis is pre-deploy risk estimation. If you're about to deploy a new version of a shared service that's in the call path of 12 downstream consumers, that's a higher-risk deploy than one affecting 2 downstream consumers with circuit-breaker isolation. Quantifying that risk before the deploy — and surfacing it to the engineer doing the deploy — is how you make incident risk management proactive rather than reactive.

The output of pre-deploy blast radius analysis isn't a block — it's context. "This service has 8 direct downstream consumers. 3 have circuit breakers configured. 2 make synchronous blocking calls with no fallback. Proceed with awareness." That's the kind of context that changes behavior at deploy time without becoming a bureaucratic gate.

The same dependency graph that powers incident blast radius traversal powers pre-deploy risk scoring. One model, two use cases: reactive during incidents, proactive during deploys. That's the leverage in treating the dependency graph as a first-class operational artifact rather than an architecture diagram you update once a quarter.