I've been on-call in some form since 2017. The texture of it has changed significantly, and not all of it in the direction you'd expect from a decade of tooling advancement. The phones are louder. The systems are more complex. The recovery expectations — both for customers and internally — have compressed. Being paged at 2am used to mean you had a few minutes to orient yourself before taking action. Now it often means you have a few minutes to orient yourself, consult three dashboards, three services' logs, and a dependency graph, and have a confident initial theory before your on-call lead joins the incident bridge.
Building Devloom has forced us to think very precisely about what makes that experience worse and what actually helps. This is an honest accounting of what's changed, what's improved, and where the remaining pain lives.
What's genuinely better: deployment-to-incident correlation
This is the area of clearest progress in the last few years. Deployment event streaming — getting your CI/CD pipeline to emit structured deployment events into your observability stack — has become standard practice. The result: when an alert fires, the first question ("did something just deploy?") now takes 15 seconds instead of three minutes of checking Slack deploy bots and GitHub deployment logs.
The pattern is simple but required tooling investment to get right: Datadog Events, Grafana Annotations, or OTEL deployment markers all serve the purpose. You can overlay deploy markers on your metrics graphs and see instantly whether the error spike started before or after the deploy, and whether the deploy touched the service that's alerting or an upstream. This alone removes the most common false start in incident response — confidently rolling back a deploy that wasn't the cause.
The teams that haven't done this yet are usually stuck on "it's not that hard, but we haven't prioritized it." For any team running more than 20 services with multiple deploys per day, it's worth prioritizing. The time savings per incident compound quickly.
What's better, but only partly: alert quality
Alert hygiene has improved in teams that have invested in it, but it hasn't improved uniformly. The structural problem is that alert cleanup is maintenance work, and maintenance work loses to feature work in most team backlogs. The result is that many teams have newer, cleaner alerting on newer services and a long tail of legacy alerting on older services that was set up years ago and fires semi-randomly.
The improvement that's actually helped: multi-condition alerts. Instead of alerting on "error rate above 5%" as a standalone condition, alerting on "error rate above 5% AND request volume above baseline AND latency elevated." Single-condition alerts have too many causes; multi-condition alerts target incident patterns specifically. This cuts alert noise significantly without requiring alert redesign from scratch — you're adding conditions to existing alerts rather than rebuilding them.
What hasn't improved: the alert backlog problem. Teams with 200+ alert rules that nobody fully understands. The solution here isn't better alert management tooling — it's an intentional audit where every alert gets a named owner and an expiration date. Alerts without owners get disabled. This is painful but it's the only thing that actually works.
What's new: AI-assisted context assembly
This is where things have changed most in the last 18 months. The core problem of incident response has always been context assembly: getting from "this alert fired" to "I understand what's happening and why" takes time, and that time is mostly manual — opening dashboards, checking logs, reading service dependencies, finding the last deploy. The assembly process itself is not cognitively hard. It's just slow.
AI tooling has started meaningfully compressing context assembly time, specifically for the retrieval and correlation steps. When Devloom receives an alert, it starts assembling context before the engineer has opened their laptop: here are the services affected, here's the dependency graph for those services, here's the last deploy that touched any of them, here's the error pattern in the logs, here's a confidence-scored theory about the root cause. When the engineer opens the incident, they're not starting from zero — they're reviewing a structured brief and deciding whether the theory holds.
I want to be careful here about what this does and doesn't do. It doesn't replace judgment. The confidence score on a root cause theory can be wrong. The blast radius can be incomplete if service dependencies aren't fully mapped. The engineer still has to evaluate whether the presented theory is correct and whether the suggested remediation is safe to apply. What it removes is the 15 minutes of mechanical retrieval — the "let me open Grafana, now let me check Loki, now let me find the deploy in GitHub Actions" loop that happened before you could form a theory at all.
What hasn't changed: the cognitive load of complex incidents
Multi-service cascading failures are still hard. The class of incident where service A degraded because service B had a memory leak that caused service C's connection pool to exhaust, which caused service A to time out on internal calls — the "follow the dependency chain through three hops" class — remains cognitively expensive regardless of tooling.
Better tooling compresses the easy incidents faster, which is genuinely valuable. But it also means on-call engineers spend more of their incident-response time on the hard ones, since the easy ones resolve before significant cognitive investment is required. The net effect on on-call experience is ambiguous: fewer long incidents with straightforward root causes, same difficulty level on the complex ones. Whether that's better depends on which you found more exhausting.
The rotation structure question
One thing that tooling doesn't solve: rotation design. Teams that have moved to follow-the-sun rotations — where on-call responsibility follows daylight hours across time zones rather than being carried by a single engineer for a full week — report significantly better incident quality. Rested engineers make better decisions, investigate more thoroughly, and are less likely to default to "restart it and see" as a first action.
The barrier is usually organizational: it requires at least two geographically distributed team members who can own the rotation, and it requires explicit rotation design rather than the ad-hoc "whoever is around" approach that happens by default. For teams under 10 people, this isn't always feasible. But for teams running critical infrastructure with 15+ engineers, the absence of timezone-distributed rotation is a choice worth revisiting.
What we're watching: automated remediation maturity
The next frontier in on-call experience improvement is automated remediation for specific, high-confidence incident classes. Not "automated rollback for all deploys that correlate with alerts" — that's too broad, as we've written about separately. But for specific patterns: a service that's memory-constrained because of a known leak that doesn't affect data integrity if the pod is restarted, a circuit breaker that tripped because of a transient upstream issue, a node that's pressure-stalling and needs to be drained. High-confidence, bounded-impact actions with clear rollback paths.
The engineering challenge is identifying those incident classes and building the confidence-scoring infrastructure to distinguish them reliably from superficially similar situations where automated action would be wrong. That's the problem we're actively working on. The promise isn't "no humans in the loop" — it's "human judgment required only for the decisions that genuinely need it, not for the mechanical actions that don't."
Being on-call in 2026 is still hard. The tooling has gotten better in meaningful ways. The job itself has gotten more complex in ways that offset some of that progress. The teams that have the best on-call experience right now are the ones who've invested in both the tooling and the rotation design — treating on-call sustainability as an engineering problem rather than a personnel resilience problem.