Incidents are truly unavoidable and each organization is different in terms of the risks associated with various types of incidents. Managing these risks in a manner acceptable to management necessitates a balance between proactive activities to minimize incidents in the first place, such as Change and Problem Management, and the ability to respond and resolve incidents effectively and efficiently.
In ITIL v3 there are two concepts that groups seeking to improve their processes should become familiar with: the Expanded Incident Lifecycle and the Event Management process.
Expanded incident lifecycle
In the ITIL Service Design book’s section on the Availability Management process there is a great discussion on the expanded incident lifecycle (EIL) and the need to understand it in order to improve availability. In essence, the EIL takes an incident and breaks it down into phases that can be formally planned for as part of service design, monitored and improved over time:
Occur – While not exactly a phase, it is worthwhile to recognize that incidents happen and that steps should be taken to pragmatically reduce their occurrence.
Detect – After an incident starts, the next step is to detect the event. Ideally, it should be via a monitoring system or at least within IT but it could slip out and be detected by customers and users first, which is far from the ideal. Often times, the gap between occurrence and detection is one of the largest durations and offers opportunities for improvement.
Diagnose – The next step is to determine what is wrong. The first question to ask should always be “What changed?” because we know that 80% of availability incidents will stem from a change in the environment and far fewer due to hardware failure, a facilities issue, etc. Any type of change in a given CI, or a related CI in the service, should be ruled in, or out, as quickly as possible to speed diagnosis.
Repair – Following diagnosis are the activities associated with repairing the configuration item (CI) that failed. Hardware may need to be ordered, vendors contacted, consultants brought in, and so forth. The biggest gap here is understanding how a given CI was configured. Groups with accurate configuration management systems (CMS) know right away whereas others will need to perform forensic archeology to try and determine that; losing valuable time in the process.
Recover – Once the CI is repaired, it must be brought back online including reloading any necessary images, applications and/or data. Again, rapid accurate knowledge about CIs will speed this up as will having standard builds/images to restore from versus building a unique system from scratch.
Restore – The last step is the restoration of the service. It may be that related CIs must be rebooted in a certain order to re-establish connectivity, and so on. Service design documentation and/or standard operating procedures that are readily accessible and accurate will aid groups restoring services.
The Incident Management process must be implemented correctly as a prerequisite and then the EIL is a powerful tool to plan incident response in the future and how to review incidents when looking for process improvement opportunities.
This process is codified in the ITIL Service Operation book and, when coupled with the EIL perspective, can help reduce incident response times and improve the mean time to restore service (MTRS). Even though it is in the Service Operation book, the planning for events begins in Service Design and is a formal process.
From a basic perspective, an event is a change in state. The Event Management process brings together the stakeholders of a service and asks “What do we need to monitor in order to manage this service?” The outcomes are then built in to the design of the service including the documentation. This is a dramatic improvement over the all too common approach of figuring out what to look for, how to monitor and then how to respond on an ad hoc, often post-incident, basis.
As a result of formal planning, monitoring is built in to the design and tested. This avoids situations wherein the very addition of monitoring causes problems in production. It also helps avoid the protracted learning curves where operations receives a service and has no idea what to monitor and how to respond.
This dual nature is important: Event Management not only involves the detection of events (through whatever means are relevant including observation) but also how to respond. When supported with the right tools, tremendous improvements in effectiveness and efficiency become possible through centralized logging, analysis and response.
There are three base types of events:
- Informational – The fact that the event transpired is recorded and that’s all that is done. For example, that a batch job completed successfully. This data may be useful at some later time thus it is captured.
- Warning – These events are processed by an event correlation system and could result in some combination of logging, automatic response, and operator alerts.
- Exceptions – These are the most serious events and would result in the tool opening the appropriate incident, problem and change records and triggering those processes to begin.
The benefits are very real. Imagine an event management tool detecting that a memory threshold is being reached on a CI and that it is okay to reboot the service at 3 AM local time. The event tool then waits until 3 AM, opens both and incident and change records, reboots the server, verifies that both the CI is recovered and the service(s) are restored and then closes both the incident and change records. This would lessen the load on the staff while also improving service levels. Granted not all incidents can be resolved automatically but it is just as important to identify what to look for and how to respond so people performing Incident Management roles can respond correctly.
IT organizations will always be under intense pressure to meet service levels while controlling costs. The EIL and Event Management process are documented in ITIL and serve as references for organizations looking for ideas on how to improve. Management can take these ideas and work on improving processes that, in turn, improve service and the pursuit of goals.
George Spafford is an experienced consultant in business and IT operations. He is a prolific author and speaker, and has consulted and conducted training on strategy, IT management, information security and overall process improvement in the U.S., Canada, Australia, New Zeland and China. His publications include hundreds of articles plus the following books: “The Governance of Green IT”, “Greening the Data Center”, “Optimizing the Cooling in Your Existing Data Center” and co-authorship of “The Visible Ops Handbook” and “Visible Ops Security”. His current area of research is the use of information technology for competitive advantage.