Incident vs Problem based on ITIL 4

As an ITIL expert, I want to discuss the difference between incident and problem based on ITIL 4. Although both of the terms are commonly used in daily business life, sometimes I observe misusage or misunderstanding of incident and problem concepts. In this short article I discuss the relationship of incident and problem in a fundamental approach.

Firstly, incident is certainly more familiar issue than problem and easy to understand. Based on ITIL 4, definition of incident is “unplanned interruption to an IT service or reduction in the quality of IT service.” As an example, an unplanned outage of a service during working hours is accepted as “incident”. When an incident raised, incident management process should ensure that normal service operation is restored as quickly as possible.

Incident vs. Problem

So the incident is solved by helpdesk or by second level staff. The service just begin to work within its service level agreement (SLA). It seems there is not any problem. Is that really so? When should a problem arise? Let’s look at the definition of the problem. Problem is “a cause of one or more incidents. The cause is not usually known at the time a problem record is created, and the problem management process is responsible for further investigation.” The definition indicates that when the root cause of the incident is not detected (although it might be temporarily solved) and it is likely to reoccur, further understanding and investigation become necessary. In fact problem is related to solution, moreover permanent solution.

Sometimes IT staff is reluctant to relate themselves with the “problem”. That’s because of the meaning of “problem” in dictionary but not by ITIL dictionary. Becoming part of a problem means becoming part of investigation and solution that prevents recurrence of the incident.

Let’s turn back to the question. When the problem record should be created? There might be different triggers but first continue on the case of problem triggered by incident. After the workaround eliminates the incident, the question of “how can we prevent recurrence?” or simply “did we prevent recurrence?” should be asked. If the answer is no, initially check that, is it related to a current problem? If yes, match this issue with the current problem record otherwise create a new problem record. This is a typical “reactive problem management process” which takes action after an incident already occurred.

Besides reactive problem management, proactive problem management could trigger a problem. Before raise of an incident, reviewing trend of events could give an insight for a future problem. Also assessments to improve service quality could trigger problem records. Although proactive problem management is not frequent, it is essential for continuous service improvement.

There are various different methodologies to define the root cause of a problem. One of my favorite method is “5 Why.” But this might be a further subject of another article.

Let’s briefly mention when a problem record should be closed. When final solution is applied, the problem record should be closed. Defining the action plan without implementation is not sufficient because within this time interval, incidents might be recurred. So the continued incidents should be matched with the problem record.

-Service has unplanned outage during business hours > incident raised.

-By restart of the service, it began to work as usual > incident solved. (workaround solution)

-Ask for “did we prevent recurrence?” if the answer is no > create problem record.

-By further investigation, identify the root cause and eliminate the root cause > permanent solution implemented.

-Close the problem record. > save the solution to known error database.

  • Not each incident necessarily becomes problem >> If the cause of the incident is known, problem record is not necessary.
  • Not each problem necessarily has former incident >> Proactive problem management could trigger problems.

being curious and looking beyond