Major Work Behind the Major Incident Process

Marvin Kirkendoll
Marvin Kirkendoll, UIT’s Service Monitoring Systems Lead and Major Incident Process Owner

What’s a critical service?

A critical service is a technology service that many of us rely on to do our work. It requires timely communication during emergency situations.

  • Mission critical services are Priority 1 (P1) and require continuous availability, e.g., email, Oracle Financials, PeopleSoft, Zoom. 
  • Business critical services are Priority 2 (P2) and require continuous availability for effective business operations, e.g., Cardinal Print, Slack Grid, Jira. 
  • Business operational services are Priority 3 (P3) and contribute to efficient business operations, e.g., Authority Manager, Smartsheet, Tableau. These are not considered critical services; however, if they experience widespread impact, these P3 incidents can be escalated and entered into the major incident process. 

While not a regular occurrence, critical technology services are bound to have shortcomings from time to time, and it's our responsibility as technologists to ensure that these services run as smoothly as possible.

When incidents escalate to a major incident — that is, one or more critical services experience a widespread outage, degradation or localized impact — the University IT (UIT) major incident process is immediately activated. The process aims to quickly triage, communicate about, and restore the affected service(s) — a vital part of IT service management (ITSM) at Stanford.

Major Incident Process Overview

Take a high-level look at these five steps that make up the major incident process.

  1. Detect, record, and propose. Incidents are detected by end users, UIT internal staff, or monitoring systems. The incidents are prioritized as P1 (priority 1) or P2 (priority 2) depending on the outage scope (widespread or local impact) and criticality. P1 and P2 incidents are automatically assigned to the IT Operations Center (ITOC).
  2. Triage and validate. ITOC validates the major incident, and based on established criteria (see table below), the Department Operations Center (DOC) may be activated. If there is an impact on Healthcare, ITOC notifies the ITI Emergency Response (I-TIER).
  3. Promote and assign. Once promoted to a major incident (with the DOC activated if necessary), the bridge and communications pathways are also activated for the affected service. ITOC facilitates assigning tasks to the appropriate support groups or subject matter experts (SMEs) for investigation. ITOC also creates an outage record, posts outages on the Service Alerts portal, and sends client service alert notifications.
  4. Investigate and restore. Support groups and SMEs investigate the cause of the major incident and issue time-based escalation and service alert notification updates. For non-healthcare services, ITOC updates clients hourly for P1 incidents and every two hours for P2 incidents. Healthcare service notifications are sent every 30 minutes. Once the service is restored, the groups complete incident tasks, resolve the outage record and send client resolution notifications.
  5. Evaluate, document, review. ITOC creates a problem record in ServiceNow, and then, SMEs and the IT Resilience team evaluate the cause of the major incident. The problem record remains open as the teams work to investigate and document the root cause and permanent remediation. The SMEs submit a change request for permanent fixes as appropriate. ITOC and SMEs document information in a knowledge article for future reference. Finally, for major incidents, ITOC coordinates and conducts an “After Action Review” for continuous improvement. As owners of the problem management process, the IT Resilience group coordinates and conducts the “Root Cause and Corrective Action Review” for all major incidents and DOC events.
Marty Dart, ITOC Technical Lead
Marty Dart, ITOC Technical Lead

Dive deeper into the roles, responsibilities, and details for each step in this UIT Major Incident Process in Practice document.

Major Incident vs. DOC

As the major incident owner, ITOC determines when to activate the DOC. This occurs when there is widespread impact to a service that supports multiple critical services. The following table outlines the appropriate activation based on scaled P1 incidents.

P1 Incident Activation Criteria Major Incident DOC
Loss of redundancy for a single critical service offering No No
Service degradation of a single critical service offering Yes No
Service outage of a single critical service offering Yes No
Service outage of a single service offering that supports multiple critical service offerings  Yes Yes
Service outage of a multiple critical service offering Yes Yes
Major or critical information security incident Yes Yes
Utility (electricity, chilled water) degradation or outage in a technical facility (Forsythe, ECHs) Yes Yes
Utility (electricity, chilled water) degradation or outage affecting multiple buildings on campus Yes Yes

Note: The UIT DOC plan is currently in the process of being reviewed and updated. 

Questions about the process

For questions about the major incident process, reach out by email or Slack to UIT’s Service Monitoring Systems Lead and Major Incident Process Owner Marvin Kirkendoll. For questions about ITOC-specific procedures, contact ITOC Manager Mike Dimaano or ITOC Technical Lead Marty Dart.

Learn more

DISCLAIMER: IT Community News is accurate on the publication date. We do not update information in past news items. We do make every effort to keep our webpages up-to-date.