#ITConnections: Increase the Efficacy of Your Alerts with a Strategy that Can be Translated to Remote Work
Relevant and timely alerts are a key tool for managing network performance successfully. However, network administrators are often overwhelmed by alerts. Either they receive too many alerts or get notified of ones that aren’t relevant to them. Join us for this #ITConnections session where we will discuss how to increase the efficiency of your alerts management strategy while having a team that is working remotely.
Before we get started let’s get on the same page about terminology. Alerts, Alarms and Incidents are often terms that are interchanged, but they really aren’t the same thing. Let’s identify the difference between them.
Klaas: We see those terms mixed up a lot. Martello iQ, our product, is pulling in data from a lot of different integrated systems, and also in those systems they use different terms, for example, some call it “problems” or “defense”, and most call them “alerts” or “alarms”.
An Incident is completely different than an event or alert, so those are separate. An Incident is more from an ITSM world, and an Incident can also be created by a helpdesk.
Alerts most often come from the system itself, IT systems like databases, computers they generate those alerts.
It’s no secret that alert management can have a huge impact on an IT Departments. Anyone that has monitored systems for a business or company knows just how many alerts can filter in on a daily basis. I understand that Martello iQ is capable of alert filtering but what would be a good alarm strategy to begin with? Is there a “cookie cutter” plan built in to iQ already?
Kevin: Well, it is true what they say; a solution is only as good as the data flowing through it! IQ definitely helps and supports teams to lower the number of alerts and incidents. One of the things it does well, is help navigate to the root-cause of an outage faster, which in turn reduces downtime and improves the user experience. What can teams do? Make sure that their monitoring systems have a good set of rules and thresholds in place that ensures that the alerts that are being pushed to iQ, are relevant and actionable. Make sure your monitoring systems are properly tuned and stay in contact with application / service owners. You will need to hear from them when there is maintenance planned, so you can plan this accordingly as well. Have e a process in place to ensure your critical services are put in a form of Maintenance Mode if they’re working on it. This will also likely avoid alert storms!
If the alarms are being automatically prioritized by the software, is it possible to run a report of all alarms for the day so I can review any low risk alarms that I need to be made aware of?
Klaas: Because iQ stores all the information we pull from those integrated systems into Elasticsearch, this flexible data store comes with reporting capability, so with that you can create your reports.
Kevin: Something at senior level management and higher level management always wants to see is “what is the health state” and “how are my services running”? One of the things that iQ does really well is allow you to create very specific perspectives or dashboards for them, so they only get to see what’s relevant for them.
For critical alarm notifications, what are the ways in which iQ will notify my team and will it notify everyone in the department of specific individuals?
Kevin: iQ has several ways of notifying users when something is affected. Usually the monitoring platforms themselves have ways to notify the teams, but because iQ collects and combines data from all of these sources, centralized notifications are helpful. This can be set up per e-mail through iQ. You can also define specific users receive alerts that are only relevant for them, meaning less alert e-mail notifications in general and more time spent on resolving these.
Do alarms in iQ let me know if there is an associated SLA that might be impacted?
Kevin: Yes, iQ provides that. Much like the e-mail alerts you can set for your Business Services or run reports based on this data, these can be separated out on the different perspectives. This means you can set up alerts for:
- End-User Experience: Are my users able to use / access the service?
- Application: Are the applications required to run the service healthy?
- Infrastructure: Is the underlying infrastructure still doing its job?
- SLA: Am I still meeting our SLAs this month, or am I at risk of breaching it?
These combined give a comprehensive view of the overall health of the customers’ services and ensures teams are only focused and working on alerts / incidents that are relevant for them.
What are the benefits my entire department will recognize by using iQ to handle my alarms?
Klaaus: You have teams communicate better because they’re looking at the same tool. That will drive communication between diverse teams.
Kevin: Whenever a certain service goes down, it’s either being noticed by the customer or senior management. Then they start playing the blame game, they’re trying to look at their own little silo, they can’t find anything, so they say it must be someone else’s problem. This only takes organizations longer to resolve an issue because it’s hard for them to take responsibility. Because iQ is so transparent, you have all your data sources in one place, all you have to do is take a look if one of your boards or one of your services shows in a red state. It’s simply clicking on that red button to see exactly what the source of the alerts and the correct teams will have been already notified by email, if it’s correctly set up. It definitely helps with intercommunication between the teams- less of a finger-pointing game but more transforming that into a troubleshooting attitude. Everybody’s working on the same goal.