A rant about monitoring fatigue

A rant about monitoring fatigue

Oftentimes you run a two-tiered alerting setup in your company or team - critical and non-critical. The critical alerts should wake people up in the middle of the night. The non-critical alerts however are expected to inform you in an async way which is probably through email. That system on its own is not a bad thing but it has a big problem. Non-critical alerts are often ignored after some time and people are increasingly accepting those alerts and letting them clutter up their e-mail inboxes.

The typical way of the "alerting lifecycle" goes something like this:

  1. We have no alerts for anything. We need alerts.
  2. Now we have alerts. There are too many alerts. They buzz - we ignore them.
  3. We've prioritized our alerting system. From now on only the critical ones wake me up.
  4. We ignore the non-critical alerts completely.

This type of behavior can lead to critical things being classified as uncritical and vice versa. You can quickly lose sight of the big picture or ignore problems that just seemed uncritical for too long until they end up being so critical that they lead to an incident.

To resolve this, you can try to implement regular review meetings where you go over all alerts with your team. I know that doesn't sound like fun at all. But maybe it can prevent unwanted pointless calls on one of your on-call nights or even prevent the next incident that only happened because you ignored the "non-critical" alerts for far too long.

As you go through all your alarms you can discuss if critical alerts should stay critical. After that, you can iterate over the non-critical alerts (some of them) and discuss what the team can do to clear them out. PagerDuty has some thoughts and meeting templates for stuff like this in their knowledge base if you are further interested.

Sources

Team On-Call Handoff Reviews - https://support.pagerduty.com/docs/operational-reviews#team-on-call-handoff-reviews