Automated Kubernetes rollbacks have a seductive premise: your deployment breaks something, your monitoring detects the regression, the system rolls back the deployment without a human in the loop. Zero-touch remediation. The on-call engineer's phone stays quiet.
The problem is that "your deployment broke something" is often not the actual causal structure. A deploy coincided with a traffic spike. A deploy happened alongside a database migration that's still running. A deploy touched service A, but the regression is in service B which has a data dependency on A that only matters under certain query patterns. Automated rollback of the deploy in each of these cases ranges from unhelpful to actively harmful.
This post is about the decision framework we use when building Devloom's automated rollback feature — specifically, what signals justify automated action versus signals that require a human in the loop.
What Kubernetes rollback actually does
Before the decision framework, a quick grounding on what kubectl rollout undo actually does, because there are common misconceptions that affect how teams reason about when it's safe.
A Kubernetes deployment maintains a revision history. When you roll back, you're applying the previous ReplicaSet configuration — the previous image tag, the previous pod spec, the previous resource limits. You are not reverting anything outside the pod spec. You're not reverting ConfigMaps. You're not reverting Secrets. You're not reverting anything in a connected database. You're not reverting the Kubernetes Service configuration or Ingress rules.
If your deploy included a pod spec change plus a ConfigMap change, rolling back the deployment leaves the new ConfigMap in place while running the old pods. Whether that combination is safe depends on whether the old code is backward-compatible with the new config — something the rollback mechanism has no knowledge of.
This is why "just roll it back" is sometimes exactly the wrong move. The state space after rollback may be worse than the degraded state before rollback.
The four conditions for safe automated rollback
Through working on this problem, we've identified four conditions that, when all present, make automated rollback a net positive rather than a risk:
1. The regression started within a bounded window of the deploy. We use a 15-minute window. If the error rate spike or latency regression began more than 15 minutes after the deploy completed, the causal link is weak. A database query that was slow before the deploy, a dependency that was already degraded, a traffic pattern that was already building — these don't become deploy-caused problems by proximity. Automated rollback triggered by old problems is pure noise.
2. The regression is in the same service that was deployed. Cross-service regressions require causal reasoning that automated systems handle poorly. If service A was deployed and service B is now returning errors, the path from deploy to regression may run through service A's changed behavior, or it may run through coincidence. Automated rollback of service A based on service B's errors is usually wrong.
3. The deploy change set is pod-spec-only. No ConfigMap changes, no Secret changes, no adjacent schema migrations that touch the same data the new pod spec reads. Pod-spec-only changes are genuinely reversible by rollback. Mixed change sets are not.
4. The regression signal is clean: error rate, not latency alone. Error rate increases (5xx responses above a threshold) are a strong signal that the change broke something. Latency increases without error rate increases are more ambiguous — they may indicate a rollout that's still in progress, a cache warming period, or an unrelated dependency issue. We automate on error rate; we alert and require human confirmation on latency-only regressions.
When all four conditions are present, we're confident enough to trigger automated rollback. When any one is absent, we generate an alert with the rollback command pre-populated and require a human decision.
The schema migration problem deserves its own section
Deploy-plus-schema-migration is the scenario where automated rollback is most dangerous and where most automated rollback horror stories originate. The sequence typically looks like: migration runs and adds a column, new pods deploy and start writing to the new column, error rate spikes for unrelated reason, automated rollback deploys old pods, old pods can't read the new column schema, full service outage.
The rollback made a partial outage into a full one because the automated system had no awareness that a database migration was running concurrently with the deploy.
Our approach: Devloom's rollback decision checks for any schema migration events (tracked through the deployment event stream we ingest) in the 30-minute window before and after the deploy. If a migration is present, automated rollback is disabled for that deploy. Full stop. The operator has to decide. This is a case where the cost of a wrong automated action is much higher than the cost of a brief human-decision delay.
Rolling update state and partial rollbacks
Kubernetes rolling deployments add another complication: the old and new versions run simultaneously during the rollout window. If you trigger a rollback partway through a rolling update, you may be rolling back from a half-deployed state rather than a fully-deployed state. The revert goes back to the previous ReplicaSet, but which pods were on which version at the moment you issued the rollback determines what state you land in.
In practice, this is usually fine — the previous ReplicaSet configuration is stable, and Kubernetes will converge on it. But it means the rollback completion isn't instantaneous. During the rollback rollout, you have multiple revisions running simultaneously again, which can make your error rate metrics look confusing.
For automated rollback systems, the implication is: do not re-evaluate the rollback trigger condition during the rollback rollout itself. Seeing a noisy error rate while two pod revisions are in flight and triggering a second rollback will eventually hit the revision history limit and make recovery harder. Set a 10-minute post-rollback blackout period before re-evaluating.
What automated rollback should hand off to the engineer
When we do trigger an automated rollback, we treat it as the beginning of the incident investigation, not the end. The automated rollback is the immediate mitigation — it reduces customer impact while the root cause is being investigated. It is not a resolution.
The handoff we give to the on-call engineer after an automated rollback:
- Which deploy was rolled back and what was in the change set (commit diff, PR link)
- The signal that triggered the rollback (error rate graph, threshold, timing)
- Current error rate post-rollback (is it recovering?)
- Whether the root cause is likely in the rolled-back code or in something else (Devloom's causal inference output for the incident window)
The last item matters because automated rollback might have resolved the symptoms without addressing the cause. If the error rate recovers after rollback, the engineer needs to understand whether the bad code has genuinely been removed or whether the system happened to recover for an unrelated reason at the same time.
The case for keeping humans in the loop on most rollbacks
We're not saying automated rollback is wrong — it's genuinely valuable for the narrow set of cases where all four conditions are met. But the fraction of production incidents where all four conditions are cleanly met is lower than people expect. In our experience, it's roughly 30-40% of deploy-correlated incidents. The majority have some complicating factor: a concurrent migration, a cross-service regression, an ambiguous latency signal, or a deploy that touched config alongside the pod spec.
For that 60-70%, the better answer is a fast human decision with good context — not automated action on a signal that doesn't cleanly imply what the automated action assumes. The goal of Devloom's rollback suggestion feature is to make the human decision fast: present the rollback command, the blast radius assessment, the confidence score, and the "apply" button within 90 seconds of the alert firing. Speed of human decision plus correctness beats speed of automated action plus wrong assumption.