How to Improve Incident Response Times

stopwatchIncidents are truly unavoidable and each organization is different in terms of the risks associated with various types of incidents. Managing these risks in a manner acceptable to management necessitates a balance between proactive activities to minimize incidents in the first place, such as Change and Problem Management, and the ability to respond and resolve incidents effectively and efficiently.

In ITIL v3 there are two concepts that groups seeking to improve their processes should become familiar with: the Expanded Incident Lifecycle and the Event Management process.

Expanded incident lifecycle

In the ITIL Service Design book’s section on the Availability Management process there is a great discussion on the expanded incident lifecycle (EIL) and the need to understand it in order to improve availability. In essence, the EIL takes an incident and breaks it down into phases that can be formally planned for as part of service design, monitored and improved over time:

Occur – While not exactly a phase, it is worthwhile to recognize that incidents happen and that steps should be taken to pragmatically reduce their occurrence.

Detect – After an incident starts, the next step is to detect the event. Ideally, it should be via a monitoring system or at least within IT but it could slip out and be detected by customers and users first, which is far from the ideal. Often times, the gap between occurrence and detection is one of the largest durations and offers opportunities for improvement.

Diagnose – The next step is to determine what is wrong. The first question to ask should always be “What changed?” because we know that 80% of availability incidents will stem from a change in the environment and far fewer due to hardware failure, a facilities issue, etc. Any type of change in a given CI, or a related CI in the service, should be ruled in, or out, as quickly as possible to speed diagnosis.

Repair – Following diagnosis are the activities associated with repairing the configuration item (CI) that failed. Hardware may need to be ordered, vendors contacted, consultants brought in, and so forth. The biggest gap here is understanding how a given CI was configured. Groups with accurate configuration management systems (CMS) know right away whereas others will need to perform forensic archeology to try and determine that; losing valuable time in the process.

Recover – Once the CI is repaired, it must be brought back online including reloading any necessary images, applications and/or data. Again, rapid accurate knowledge about CIs will speed this up as will having standard builds/images to restore from versus building a unique system from scratch.

Restore – The last step is the restoration of the service. It may be that related CIs must be rebooted in a certain order to re-establish connectivity, and so on. Service design documentation and/or standard operating procedures that are readily accessible and accurate will aid groups restoring services.

The Incident Management process must be implemented correctly as a prerequisite and then the EIL is a powerful tool to plan incident response in the future and how to review incidents when looking for process improvement opportunities.

Event management

This process is codified in the ITIL Service Operation book and, when coupled with the EIL perspective, can help reduce incident response times and improve the mean time to restore service (MTRS). Even though it is in the Service Operation book, the planning for events begins in Service Design and is a formal process.

From a basic perspective, an event is a change in state. The Event Management process brings together the stakeholders of a service and asks “What do we need to monitor in order to manage this service?” The outcomes are then built in to the design of the service including the documentation. This is a dramatic improvement over the all too common approach of figuring out what to look for, how to monitor and then how to respond on an ad hoc, often post-incident, basis.

As a result of formal planning, monitoring is built in to the design and tested. This avoids situations wherein the very addition of monitoring causes problems in production. It also helps avoid the protracted learning curves where operations receives a service and has no idea what to monitor and how to respond.

This dual nature is important: Event Management not only involves the detection of events (through whatever means are relevant including observation) but also how to respond. When supported with the right tools, tremendous improvements in effectiveness and efficiency become possible through centralized logging, analysis and response.

There are three base types of events:

  • Informational – The fact that the event transpired is recorded and that’s all that is done. For example, that a batch job completed successfully. This data may be useful at some later time thus it is captured.
  • Warning – These events are processed by an event correlation system and could result in some combination of logging, automatic response, and operator alerts.
  • Exceptions – These are the most serious events and would result in the tool opening the appropriate incident, problem and change records and triggering those processes to begin.

The benefits are very real. Imagine an event management tool detecting that a memory threshold is being reached on a CI and that it is okay to reboot the service at 3 AM local time. The event tool then waits until 3 AM, opens both and incident and change records, reboots the server, verifies that both the CI is recovered and the service(s) are restored and then closes both the incident and change records. This would lessen the load on the staff while also improving service levels. Granted not all incidents can be resolved automatically but it is just as important to identify what to look for and how to respond so people performing Incident Management roles can respond correctly.

IT organizations will always be under intense pressure to meet service levels while controlling costs. The EIL and Event Management process are documented in ITIL and serve as references for organizations looking for ideas on how to improve. Management can take these ideas and work on improving processes that, in turn, improve service and the pursuit of goals.

George Spafford is an experienced consultant in business and IT operations. He is a prolific author and speaker, and has consulted and conducted training on strategy, IT management, information security and overall process improvement in the U.S., Canada, Australia, New Zeland and China. His publications include hundreds of articles plus the following books: “The Governance of Green IT”, “Greening the Data Center”, “Optimizing the Cooling in Your Existing Data Center” and co-authorship of “The Visible Ops Handbook” and “Visible Ops Security”. His current area of research is the use of information technology for competitive advantage.

How to Get Started with ITIL

Six steps are all you need to get Incident Management under control, writes ITSMWatchcolumnist Hank Marquis of Global Knowledge.

ITIL spends lots of time on the things you could do but little time on the basics required to get started. This leads to lots of questions about where to start, understanding how much you are already doing, and even when to stop and move on to the next process or activity.

ITIL v3 defines an Incident as an unplanned interruption to service, or a reduction in the quality of a service. Incident Management (IM) is the term used to describe the process responsible for managing incidents. The main goal of IM is to restore service to users as quickly as possible. When you handle a call or email from a customer or user experiencing something wrong with a service you provide, you are “doing” IM. ITIL explains the reasons for Incident Management and how to setup an Incident Management process.

There are many aspects to carrying out IM (explained in detail within ITIL), however, there are just six things you need to do. If you do these six things consistently then there is a very good chance you are “doing” IM well enough to achieve the benefits the ITIL describes:

1. Create and maintain a database for all IM records. Make sure every Incident (each and every call, email, fax or drop-by) gets recorded into a database — and that means capturing important information about the Incident, as well. Be sure to capture information about the Incident (work done, resolution, note, etc.)

2. Create a knowledge base by capturing and providing access to other records, data and information as well — as much information as required to perform the IM function. ITIL calls this a configuration management database (CMDB) and/or a configuration management system (CMS).

3. Make sure you have (and enforce) written procedures for recording, prioritizing, classifying, escalating and reviewing Incidents.

4. Put in place procedures to manage the impact (effect of an Incident) on customer business based on how defined service levels will be affected.

5. Create a “major incident” model — a set of rules that clearly describe a very bad incident. Major incidents affect important services and/or numerous customers and users. In either case, a major incident requires immediate escalation, notification of customers and IT management and other special handling. Basically the concept is that this incident type needs a full organizational response as soon as possible.

6. Keep the reporter of the incident (that is, whomever contacted you about the service issue) informed about the status of the incident as your team works it. You might also have to keep the customer (users work for customers) informed, as well. Be sure to know in advance whom to keep updated, how often to update them. You may also need to advise them if it appears that you wont be able to get service back into agreed parameters within agreed time frames too.

According to the International Standard for Information Technology — Service Management (ISO/IEC 20000-1 or ISO20K for short) if you are not doing even one of the above items, you can improve your service quality by focusing on doing it. If you are already doing these things consistently, then you might not need to spend more time in this area — focus on another ITIL area instead, perhaps Problem Management or Change Management.

These six items might take you months to implement and tune — but the results will be tremendous improvement. Use these six steps to guide your ITIL implementation, and use the guidance in ITIL if you have questions about how to do these steps.

Do note that there is no need to use ITIL for these activities! In fact, you can “grow your own” solution or use any of the other IT service management frameworks out there since ITIL isn’t the only service management practice available. Of course, ITIL is the de facto standard world wide and is arguable the most mature collection of guidance for enterprise IT organizations.

When you have mastered these six steps you may or may not be “done”. These steps are the minimum number of activities you should have and do well to deliver the benefits described in ITIL.