This article has also been published on Medium
There has been much written about the right way to handle alarms and alerts for Sysadmins, Ops and Reliability Engineers. I take the approach that you can learn as much from looking at how not to do it. Here are some examples. I’m sure readers can think of many more. This is one small part of a big field and doesn’t begin to cover all the other areas that system monitoring and feedback etc. touch on. Neither am I attempting to cover the greater field of Ops, systems administration, Site Reliability Engineering etc in any detail. Lastly, I am talking pretty much about the politics of alerts rather than the technical aspects. The below are principally about out-of-hours ‘on-call’ type alerts but the principles are general:
Fool’s paradise really. ‘Nuff said.
Confusing or Conflating logging with alerting
In short you always want to have some things logged to give context for when you wind up investigating a problem. You don’t want your logs sent to you as alerts however. An alert is something that requires action. If it doesn’t require action then it is not an appropriate alert and should not be termed or treated as such. If you must have logging data sent by email with all the overhead and fuss that that brings, then send it to a logging address, don’t set up a mailing list and subscribe people to it. Not doing this is a sure way to get your entire mailing list muted.
Having standing reports and billing sent to the same location
Similar to the above. If you are sending alerts, great, if you are sending billing or weekly reports, great, but don’t send them to the same address or mailing list please, and please don’t subscribe me without asking. I really don’t care about a weekly round-up of up-time by email. If it’s important then I will be monitoring it already, If it isn’t then I don’t want or need it- either way it’s just spam. If it’s only important to $(manager) and they for some reason have to have an email rather than a web page then great, they can subscribe to it.
Having meaningless alerts
Sounds obvious, but so common- people become addicted to the idea that everything can be monitored and set all sorts of alerts on all sorts of things. If it doesn’t require action then it is not an alert and should not be treated as one. All you’re doing is adding to the cognitive load of the people dealing with this stuff and hindering their ability to work. One example of this would be automated status updates – see next item.
Having alerts for ‘everything is OK’
One of the most infuriating alerts- these should not even exist. Great, you want to run a check every ten minutes on something. Happy birthday. I don’t want to be subscribed to a Slack channel where this is posted in real time. As a variant I don’t want to know that a Jira ticket I did not raise and did not involve me or my team was opened. Or closed. I. Really. Don’t. Care. Your busy-work is your problem, don’t make it mine.
Having inappropriate people ‘monitoring’ alerts
To the layperson an unexpected email, especially one containing any negative terms is an alert. If they don’t understand it then it will make them anxious. This generates noise. Whether it is the ground crew triaging or management ‘keeping an eye’, if you involve yourself and you’re not part of the solution then you are part of the problem. Playbooks can help to an extent, but only so far and only with staff low enough in the organisation that they are denied any discretion. This goes back to logging, or perhaps more precisely reporting, versus alerting. The classic example of inappropriate people monitoring these things is that they start reacting to issues without context or foresight and (if junior) clamouring for attention, or if more senior, making edicts without consulting the team who actually deal with it. And then they don’t notice when their edict is ignored, but because it doesn’t recur immediately they feel good because they ‘dealt with a problem’. Just like the dog barking inside the house to ‘chase away’ the postman and ‘protect the family’ really, but more of a nuisance.
Having inappropriate people setting up the alerts
Yeah we get it, you are so senior, you want access to everything, you want to ‘know everything is ok’ and you have fiscal responsibility. Just don’t set up alerts yourself unless you are going to be acting on that alert yourself. Example- a billing alarm set for 75% quota usage for a 24 hour period. Sounds fine? So what do you suppose happens at the end of the day, say two minutes to midnight? Either you get no alert, in which case you are using far less than you are (expecting to be) paying for (i.e. < 75% of it), or else it’s pointless- you probably won’t actually go over if it’s taken you 23 hours and 58 minutes to get to 75% of your quota, but in any case there isn’t time to do anything about it. Whilst anyone can make a mistake and this one is ‘harmless’, the root cause is someone who isn’t actually in the loop tasking people who are, on their own, without effective consultation or consideration. This sort of stuff is perfect for a brainstorming session- see ‘Reflective Practice’ below .
What? It’s the technical responder’s job to deal with these alerts, not yours? Well this applies at every stage, including when the alert is set up. Repeat after me: “If you’re not capable of responding to the alert then you’re not an appropriate person to set up the alert”- see ‘Assigning all alerts the same priority/urgency’ below.
Having inappropriate people managing alerting channels
Similar to the above. Only people who are actually responsible for responding to alerts for a given team should be setting these up. If you’re a customer service lead and you have customer service alerts, then great, that works. Don’t set the raw technical alerts for the technical team without consultation however. Processing the alerts is part of the responders’ jobs and they don’t need more noise to do so. If you find yourself in this position then speak with your tech people. You would trust them to deal with the problem, so trust them to deal with the alerting. Agree some sort of logging or reporting that you can review at a civilised hour. If you can’t bear to relinquish some of the alerting then great, have those sent just to you and review them with the team in the morning. If you need some stuff to go to tech and some to customer service or whatever then agree some sort of cascade or split – don’t unilaterally include everyone.
Creating hundreds of mailing lists or slack channels for $(Blah) and then subscribing other people to them
Related to the above, having large numbers of slack channels or mailing lists for different categories of alert. Unless you have different teams for different categories of alert then this isn’t helpful. Seriously, do you really think that everyone on the list needs to be aware of every message? All you’re doing is adding to everyone’s workload expecting them to ‘monitor’ (i.e. ignore) all of this. Managers’ busy work is still just busy work.
Having too many comms applictions
Every messaging medium is different and has to be installed, managed and monitored separately. If you can’t reach a sensible accommodation regarding your comms channels then you’re adding a huge load to your workforce. Example: Email + Slack + MS Teams + Skype for Business + Yammer + Pagerduty + Intenal alerting channel
Assigning all alerts the same priority/urgency
Doesn’t take a genius to realise that this isn’t clever, but again surprisingly common. Perhaps something of a natural consequence of having inappropriate people managing alerts that they are not themselves responding to. The Acid test: Would you like to have a junior member of staff phone you and wake you at 3 am for a given alert, and if you did, would you like to be the one responsible for getting hold of the technical people to attend to that specific alert? Assuming you got hold of them quickly, would you really want to wait for them to deal with the issue and then debrief before going back to bed? Naturally you had better ensure that you are available, sober and not distracted, e.g. by your kids, your pets, your hobbies etc whilst you’re ‘on call’. No? Well then that’s not worth alerting people out of hours for is it? Some alerts can wait until the morning or even until later in the week. Put these on a yellow channel and don’t crowd the emergency channels with them.
Having duplicate alerting, e.g. slack and email
Again, sounds really obvious but people love to do it. You would think that they had better things to do, but, yeah. Having an automated alert go to a slack channel that is monitored by people on another slack channel who then contact people on a third slack channel. And then responders are expected to go back through the hierarchy…
Or having an email receipt trigger a slack message or vice versa- keep it DRY people!
Another Byzantine technique, similar to the duplicate alert. Imagine this. You get a phone call from the level 2. You have to check your slack and there is a human channel with discussion of the issue (fine). It was triggered by an automated alert to a different slack channel (hmm), which you have to click on to get the details. This takes you to a Pagerduty page (separate login), from where you then have to do some magic juju with the web page (because it hasn’t been formatted properly) to find the link to splunk/ELK/Whatever (hello different system with different login) Where you find it was actually a Cloudwatch alarm so then you have to go to the appropriate AWS account to review an alert for something that is likely to be in a different and unlinked AWS account (and region), again with a different login (because you haven’t heard of AWS Organisations). Then you may be able to start looking at the problem system. So in summary :
phone call > 2 layers of slack > pager duty > Splunk/Elastic > Cloudwatch
On what planet does this make sense? It doesn’t really matter whether you are being called out of bed at 3am or have your laptop perched on your knees as your partner drives down the motorway on a Sunday morning with the kids clamouring in the back- this sort of additional load is simply a hindrance. Help us to help you!
A few tips
Like many management problems the solution is communication, only here it needs to be a two way street. The Google Site Reliability Engineering playbook is one useful model but if you’re not reviewing your alerts and learning from responding to them then you’re failing. If you wind up in a situation where over a period of three months your guys have over 5000 emails autofiltered to ‘mark read’; ‘move to folder:BS’ that they’ve never read and 17 out of 20 ops slack channels are muted then you’re doing it wrong. If you aren’t aware of this then you’re asleep at the wheel. If you are aware of it and its scale then ask yourself – what is in those communications and why was it ever thought valuable enough to demand these guys’ attention? What do you think led to a situation where they were ignoring 150 emails and hundreds of slack messages a day and nothing happened as a result, or even how did they come to be receiving that level of spam? What were you doing at the time? What are you going to do about it now? If you were ignoring these alerts yourself then just think- would you like to be making announcements in person and have everyone ignore you? How do you think that would be in a professional environment? This is like that.
Properly there should be a single alerting channel with a direct link to the primary alerting, e.g. Cloudwatch that can vbe escalated, not nested. Not to mention using e.g. AWS Organisations (for AWS).
Whilst you might not want to have a regretrospective postmortem every call-out, it is certainly worth ensuring a periodic review of your alerts and alarms. Ideally frequently enough that the team dealing with them have some recollection of events since the last review. Even if you aren’t doing continuous deployment or agile sprints, the alerts set up at the beginning of a project may no longer be appropriate once it has been deployed for a little while. Outmoded alerts are just noise and eventually the time comes to give them the ‘Old Yeller’ treatment.
The Golden principles
- Alerts != Logging
- Do not mix alerts with logging or status updates
- Manage by Exception. This is a well established principal in many formal management techniques, e.g. PRINCE2. It is well-established for a reason. Status updates; ‘situation OK’ reports etc do not belong with alerts.
- Silence is golden. Seriously if your alerts are silent then you should be golden. This should be your target.
- Restrict handling alerts configuration, channels etc. to people who are responding to the alerts and know what they are doing. If you are the non-technical big chief then follow your technical people’s lead in this for technical alerts. If you find yourself unable to leave it to them then:
- Ensure a satisfactory explanation for whatever they set up.
- Ensure a suitable logging system so that you can be aware of what alerts triggered, what response was made, and when that can be reviewed during normal working hours
- Prioritise urgency of alerts
- There are many management models and many overlap but one message from ‘Lean processes’ that applies here in addition to the others above is ‘stop doing things that have no value’. It seems obvious but if you aren’t adding value to a process then anything you do to it is devaluing it:
- Concentrate on solutions rather than problems – You might be a big chief or an established operator but this is not about you or individual heroics, this is about effectively and efficiently managing a service. Remember – “the cemeteries are full of the graves of indispensable men”.