Monitoring, Event and Alert

Are you an IT service owner who wants to ensure your service is meeting customer requirements by detecting the events proactively before becoming incidents? Or are you an IT operation manager who is handling excessive amount of events and want to decide which one should be an alert? Or are you a service desk manager who wants to decide escalation of an alert to an incident or not?

Here is another IT Service Operation topic where the terms are similar to each other and easy to confuse. Let’s define the basics and differences of monitoring, event and alert concepts based on ITIL V4. I would begin with the broader term, monitoring management and continue with event management and alert, respectively.

Monitoring, Event Management, Alert

Firstly monitoring is essential to detect both event and alert. Monitoring refers to the activity of observing a situation to detect changes that happen over time. Frequently monitoring tools detect change in status of service component or asset. There are different types of monitoring, classified as active & passive or reactive & proactive. (To keep it simple, I would not go on further details.)

So your monitoring tool observes the status of key service assets continuously 7*24. How do you know there is something unexpected or service is working within acceptable limits? Event management rises here to generate and detect meaningful notifications about the status of IT service or infrastructure.

Event is defined as “a change of state that has significance for the management of an IT service.” The essential point is the “significance” of change. To prevent getting lost, the conditions that would generate events should be defined specifically. Remember that these conditions varies for each service asset’s own requirements, there is not any certain rules.

What is the scope of events? In fact events have wide variety. Not all the events necessarily mean taking an immediate action. Completion of a scheduled batch job could be an event. Or a successful login to a system could also be an event. These are “informational events.”

Whereas “warning events” indicate there is a situation requiring closer monitoring. For instance, completion time of scheduled batch job is 10% longer than normal could be a warning event. Another example could be a disk log file reaches within 5% of its acceptable capacity level. These are unusual situations where correlations related to these events should be reviewed.

Last type is “exception events” which means that there is some abnormal situation and an action should be taken. Previously recurrence of warning events might lead to warning events. For instance a device’s CPU is above the acceptable utilization rate is an exception event. Suspicious events in terms of security could be exception type. For instance, an attempt to login with incorrect password or detection of unauthorized software installation are exception type of events.

ALERT

An event should be investigated with similar and correlated events and decided to be escalated. Before naming it as an alert, ask for if human intervention is necessary? Because alert correspond to notifying the appropriate person to deal with the event. An alert should address a responsible person or team and a specific action to perform on a specific device ensuring that the event is managed. So these conditions should be defined previously.

Alternatively auto response option before human intervention could be evaluated. Restarting a service or submitting a batch job could be applied automatically and checked whether the conditions generated events eliminated.

What does happen when the event could not be managed by auto response or human intervention? This might lead into an interruption of a service. In this case the alert could trigger an incident, a change or a problem. (I recommend you to read my previous article about incident and problem relationship)

  • Although there is not any event, monitoring should keep on and seek for conditions that generates events.
  • An event does not necessarily have an alert. But if there is an alert there is certainly an event.
  • Not each event requires taking action.
  • Alert includes human intervention to manage event.

ITIL Glossary

https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_TR-v1-0.pdf

ITIL Service Operations, 2011, ITIL Official Publisher

being curious and looking beyond