Tutorial

Remediation runbooks that actually work

Jordan Kessel
Abstract visualization of a structured resolution process

There are two kinds of runbooks that don't get used during incidents. The first is too abstract: it says things like "investigate the database layer" without specifying which database, which metrics to look at, or what constitutes abnormal. The second is too specific: it was written for a particular database schema, a particular replica topology, a particular alert threshold — and the system has since drifted enough that half the steps no longer apply.

Most runbook libraries contain both types, often in the same document. The result is that on-call engineers learn to treat runbooks as background reading rather than operational guidance. They skim them for context clues and then improvise. Which means the 45 minutes of runbook preparation paid for maybe 3 minutes of triage speed.

We've thought a lot about this at Devloom because our remediation diff feature outputs structured fix suggestions — and those suggestions are only as good as the underlying reasoning about what to actually do. The same structure problems that make human-written runbooks useless also make generated remediation suggestions useless. Here's the structure we've found works.

The three-layer runbook structure

A functional runbook operates at three distinct layers, and conflating them is what causes both failure modes described above.

Layer 1: Symptom verification. This is what the alert tells you and how you confirm you're actually seeing it. It answers "what does this incident look like?" before any remediation starts. A good symptom verification section includes the specific metric or log pattern that triggered, the threshold, and at least one alternative query to confirm you're not dealing with a monitoring pipeline false positive. Teams skip this layer because it feels obvious — until you're paged at 2am and not sure if the alert is real or your metrics pipeline backed up.

Layer 2: Scope assessment. Before you touch anything, you need to know how far the problem has spread and what depends on the affected service. This is blast radius assessment formalized into a runbook step. Questions in this layer: Is this one instance or all instances? Is the error rate rising or stable? Which downstream services are showing degradation? This layer should take two to five minutes and should leave you with a clear picture of the impact boundary.

Layer 3: Remediation decision tree. This is where most runbooks start, skipping layers 1 and 2. The remediation layer should be a conditional structure, not a linear list. "If you see X, do Y. If you see X but also Z, do W instead." The conditions should be testable from your observability stack, not judgment calls that require knowing the service intimately.

What makes a remediation step actionable

A remediation step is actionable if an engineer who has never seen this service can execute it safely under pressure. That's the bar. It sounds strict but it's the right one, because the on-call engineer who picks up your alert at 3am may be joining the rotation from another team this week.

Actionable steps have four components:

  1. The action itself, in copy-pasteable form. Not "restart the payment service pods" — kubectl rollout restart deployment/payment-api -n production. Exact commands, exact namespaces, exact names. The argument that "this will go stale" is real but not sufficient reason to omit it. Stale commands are fixable in the post-incident review; absent commands mean the engineer guesses under pressure.
  2. The expected outcome. What should you see in the next two minutes if the step worked? "Error rate drops below 0.5% in Grafana panel X" is verifiable. "Things should improve" is not.
  3. The rollback for this step. Every remediation step that changes system state should have an adjacent rollback. This is not optional. If you're instructing someone to drain a node, tell them how to un-drain it. If you're pushing a config change, tell them how to revert it.
  4. An escalation path if the step doesn't work. After step 3, if the expected outcome hasn't materialized in five minutes, what's the next branch? Who do you call? Which team owns the dependency that might be causing this?

The freshness problem and how to solve it

Runbooks go stale because systems change and nobody updates the documentation. The fix is not "better discipline about updating runbooks" — that's never worked anywhere. The fix is architectural: runbooks should reference stable abstractions rather than specific implementation details wherever possible.

Stable: the name of the service, the name of the team, the category of action (restart, rollback, scale). These rarely change.

Unstable: specific deployment names, specific port numbers, specific database hostnames, specific threshold values. These change frequently.

For unstable details, the runbook should either link to a living source of truth (a service registry, a config management system, a dashboard) or include a "verify before running" note. "Run kubectl get deployments -n production | grep payment to confirm the deployment name, then substitute below" is ugly but honest.

We're not saying you should avoid specifics entirely. A runbook with no specific commands is useless. The goal is to isolate the specific commands that are likely to drift and flag them explicitly, rather than presenting the whole document as if it were always current.

Runbook ownership and the post-incident update gate

Runbooks without owners get stale. The ownership model that works is: the team that owns the service owns its runbooks. Not a central ops team, not a documentation team. The people who will be paged when this service breaks.

The enforcement mechanism is the post-incident review. Every incident that involved a runbook — whether it helped or not — should end with a runbook update as a required action item. Not "update if needed," not "consider updating." Required. The update review takes 15 minutes and prevents the next incident from being longer than it needs to be.

If a runbook was consulted and the engineer went off-script anyway, that's signal. Document what they actually did. That's the runbook now.

Where Devloom's remediation diffs fit into this

Our remediation diff output is designed to slot into Layer 3 of the structure described above — after symptom verification and scope assessment have already established context. When Devloom identifies a root cause (a bad deploy, a config change, a dependency timeout), the remediation diff gives you a specific proposed action with a change preview: here's what the rollback would look like, here's what the config revert changes.

That's not a replacement for a runbook. It's a replacement for the "what command do I run" step within a runbook — the part that's most likely to be stale or missing. The judgment about whether to apply the remediation, whether the blast radius is acceptable, and who needs to be notified: that still lives in the runbook, and the runbook still needs the three-layer structure to be useful.

The goal of combining structured runbooks with automated remediation suggestions is to compress the time between "alert fired" and "engineer takes confident action." The runbook handles orientation; the diff handles the specific fix. The engineer handles the judgment call about whether to apply it.

Avoiding the checklist trap

The last failure mode worth naming: runbooks that become compliance artifacts rather than operational tools. This happens when runbooks are written to satisfy an audit requirement or a management request rather than to help the person on call. They check boxes but don't answer questions. They're thorough in coverage but thin on specifics.

The test for whether a runbook has fallen into this trap: give it to a new engineer and ask them to simulate executing it on a staging system. If they get stuck more than twice in the first five steps, the runbook is an artifact, not a tool. Fix it before the next incident.