While not a regular occurrence, critical technology services are bound to have shortcomings from time to time, and it's our responsibility as technologists to ensure that these services run as smoothly as possible.
When incidents escalate to a major incident — that is, one or more critical services experience a widespread outage, degradation or localized impact — the University IT (UIT) major incident process is immediately activated. The process aims to quickly triage, communicate about, and restore the affected service(s) — a vital part of IT service management (ITSM) at Stanford.
Major Incident Process Overview
Take a high-level look at these five steps that make up the major incident process.
- Detect, record, and propose. Incidents are detected by end users, UIT internal staff, or monitoring systems. The incidents are prioritized as P1 (priority 1) or P2 (priority 2) depending on the outage scope (widespread or local impact) and criticality. P1 and P2 incidents are automatically assigned to the IT Operations Center (ITOC).
- Triage and validate. ITOC validates the major incident, and based on established criteria (see table below), the Department Operations Center (DOC) may be activated. If there is an impact on Healthcare, ITOC notifies the ITI Emergency Response (I-TIER).
- Promote and assign. Once promoted to a major incident (with the DOC activated if necessary), the bridge and communications pathways are also activated for the affected service. ITOC facilitates assigning tasks to the appropriate support groups or subject matter experts (SMEs) for investigation. ITOC also creates an outage record, posts outages on the Service Alerts portal, and sends client service alert notifications.
- Investigate and restore. Support groups and SMEs investigate the cause of the major incident and issue time-based escalation and service alert notification updates. For non-healthcare services, ITOC updates clients hourly for P1 incidents and every two hours for P2 incidents. Healthcare service notifications are sent every 30 minutes. Once the service is restored, the groups complete incident tasks, resolve the outage record and send client resolution notifications.
- Evaluate, document, review. ITOC creates a problem record in ServiceNow, and then, SMEs and the IT Resilience team evaluate the cause of the major incident. The problem record remains open as the teams work to investigate and document the root cause and permanent remediation. The SMEs submit a change request for permanent fixes as appropriate. ITOC and SMEs document information in a knowledge article for future reference. Finally, for major incidents, ITOC coordinates and conducts an “After Action Review” for continuous improvement. As owners of the problem management process, the IT Resilience group coordinates and conducts the “Root Cause and Corrective Action Review” for all major incidents and DOC events.
Dive deeper into the roles, responsibilities, and details for each step in this UIT Major Incident Process in Practice document.
Major Incident vs. DOC
As the major incident owner, ITOC determines when to activate the DOC. This occurs when there is widespread impact to a service that supports multiple critical services. The following table outlines the appropriate activation based on scaled P1 incidents.
P1 Incident Activation Criteria | Major Incident | DOC |
---|---|---|
Loss of redundancy for a single critical service offering | No | No |
Service degradation of a single critical service offering | Yes | No |
Service outage of a single critical service offering | Yes | No |
Service outage of a single service offering that supports multiple critical service offerings | Yes | Yes |
Service outage of a multiple critical service offering | Yes | Yes |
Major or critical information security incident | Yes | Yes |
Utility (electricity, chilled water) degradation or outage in a technical facility (Forsythe, ECHs) | Yes | Yes |
Utility (electricity, chilled water) degradation or outage affecting multiple buildings on campus | Yes | Yes |
Note: The UIT DOC plan is currently in the process of being reviewed and updated.
Questions about the process
For questions about the major incident process, reach out by email or Slack to UIT’s Service Monitoring Systems Lead and Major Incident Process Owner Marvin Kirkendoll. For questions about ITOC-specific procedures, contact ITOC Manager Mike Dimaano or ITOC Technical Lead Marty Dart.
Learn more
- Looking for a diagram of the Major Incident Workflow? We got you.
- Find current service alerts and planned outage notifications on uitalerts.stanford.edu.
- Visit this guide page to access information and resources about the major incident process and other ITSM processes at Stanford.