Incident Engineering

PagerDuty alert fatigue: causes and solutions

Jordan Kessel July 15, 2025

Abstract representation of alert overload being filtered to signal

The clearest sign of alert fatigue in a team is not the number of alerts per week. It's when on-call engineers start developing personal suppression habits — adding a PagerDuty rule to silence a specific alert class "just until the weekend", keeping a mental list of alerts that can be safely ack'd without investigation. When individual engineers develop their own informal suppression strategies outside the official configuration, your alerting system has effectively failed.

Alert fatigue is a systems problem disguised as a tooling problem. Muting alerts or adjusting PagerDuty escalation policies are tactical responses. They reduce the symptom without addressing the underlying causes: alerts that fire on symptoms rather than actionable conditions, alert volume that makes triage cognitively overwhelming, and — critically — alerts that arrive without the context needed to decide whether they're actually worth waking up for.

The three causes that account for most alert fatigue

Not all noisy alerts are noisy for the same reason. The treatment depends on the cause.

Threshold drift. An alert was written with a threshold that made sense six months ago. Since then, traffic has grown, the service has been refactored, a caching layer was added, or normal operating patterns have shifted. The threshold hasn't been updated. The alert fires two or three times a week on conditions that are in the new normal range for the service. Nobody updates it because updating it requires understanding the current operating range, which takes more time than just ack'ing the alert.

Threshold drift is endemic in growing systems. The mitigation is dynamic thresholds (alert on deviation from recent baseline rather than static values) combined with a quarterly process for reviewing alert thresholds against actual operating ranges. Static thresholds should be the exception rather than the rule for anything that's sensitive to traffic or load patterns.

Symptom alerts without root cause context. An alert fires saying "disk I/O latency elevated on node-07." The on-call engineer acknowledges and investigates. Twenty minutes later they find that node-07 is running hot because a batch job was scheduled without accounting for the concurrent production load. The alert correctly identified a symptom but gave no context about the cause, so the investigation started from scratch.

This category of alert fatigue is where the cost is highest: the alert was legitimate, but the time to resolution is much longer than it needs to be because the alert arrived without root cause evidence attached. The fix isn't better alerting — it's better context delivery at alert time.

Cascade amplification. One underlying problem triggers eight related alerts across dependent services. The on-call engineer gets paged eight times over five minutes for what is fundamentally one incident. They ack one, investigate, and eventually realize the others are downstream effects. In the meantime, the incident commander is trying to coordinate across multiple alert streams that all describe the same root event.

Cascade amplification is the most tractable of the three causes because it's addressable at the alerting layer: alert grouping, deduplication, and parent-child alert relationships. PagerDuty's event intelligence features handle some of this, as do custom event routing rules. The key is identifying which alerts reliably co-occur and building explicit grouping rules for those combinations rather than letting them all propagate independently.

The limit of silence-based remediation

Most teams' first response to alert fatigue is suppression: higher thresholds, longer evaluation windows, more aggressive deduplication, time-based silences for known maintenance windows. These are legitimate techniques. They're also only half of the solution.

Suppression reduces alert volume. It doesn't reduce the cognitive load of investigating the alerts that do fire. If you have 50 alerts per week and you suppress your way down to 30, you've reduced volume by 40% but you haven't changed how long it takes to investigate each of the 30 remaining alerts. If investigation time per alert is where engineers are actually spending their time, volume reduction alone doesn't fix the fatigue.

The other half of the solution is context delivery: making sure every alert that fires arrives with enough information that the on-call engineer can make a fast triage decision — either "this is worth investigating immediately" or "this is secondary impact from the thing I'm already investigating." Context delivery is about what arrives with the alert, not how many alerts arrive.

What good alert context looks like

The minimum context for any production alert should answer three questions without requiring the engineer to open another tool:

What changed recently in the affected service or its dependencies? A deploy, a config change, a traffic spike, a dependency timeout starting. If the answer is "nothing changed in the last two hours," that tells you something important. If the answer is "order-service v2.4.1 deployed 12 minutes ago," the investigation has a starting point.

Is this alert part of an ongoing incident? If checkout-service is already degraded and this alert is for payments-service, and payments-service calls checkout-service, this alert is almost certainly a downstream effect. That information should be in the alert, not something the engineer has to determine by opening the incident timeline.

What does the anomaly look like relative to baseline? Not just "latency is 450ms" but "latency is 450ms vs. a 7-day p99 baseline of 180ms, representing a 2.5x deviation starting at 14:23 UTC." The baseline comparison is what tells you whether 450ms is catastrophic or mildly elevated.

These three pieces of context can be computed automatically from your existing observability data — deployment events, traces, metrics. The infrastructure to deliver that context with every alert is what makes alert fatigue a solvable problem rather than a chronic condition.

The operational review cadence that prevents recurrence

Alert quality is a maintenance problem, not a one-time setup problem. Alerts that are well-calibrated today will drift as systems evolve. A review cadence is the structural commitment that prevents backsliding.

A useful quarterly ritual: for every alert that fired in the last 90 days, ask three questions. Was it actionable — did it result in actual investigation and action? Was it informative — did the alert plus its context give the engineer what they needed? Was it timely — did it fire at the right moment, not too early (before the issue was real) and not too late (after users were already affected)?

Alerts that consistently fail these criteria should be modified or removed. The instinct to keep alerts as "just in case" coverage almost always contributes more to fatigue than to safety. An alert that fires three times a month and results in no action 90% of the time is consuming attention that could be spent on alerts that matter.

We're not saying all ambiguous alerts should be deleted. Some alerts are genuinely investigative — they surface conditions that require human judgment to evaluate. The bar for those should be higher context, not lower threshold. An alert that fires in truly ambiguous situations should arrive with richer context, not a lower volume commitment.

Where this leaves on-call culture

Alert fatigue has a cultural component that tools alone can't fix. Teams that treat on-call as a necessary evil rather than a skill to invest in tend to accumulate alert debt faster than they can clear it. The investment in good alerting practices — thoughtful thresholds, context delivery, quarterly reviews — requires someone to own it with engineering rigor, not just operational maintenance.

The teams we see handling this best treat their alerting configuration as a product: it has owners, it has quality criteria, and it gets reviewed and iterated on the same way application code does. That's a cultural choice as much as a technical one. The technical tooling exists. The question is whether your team treats alert quality as work worth doing.