Why we created an incident process and why you need one too

In this blog, Gwen takes us through what you need to consider when writing an incident process

4 min read

Published: 20 Jul 2022

Gwen Diagram

Why we created an incident process and why you need one too

Incidents suck. They interrupt your day, they stress people out and, although it’s great to fix them, it’s not very satisfying fixing them. Incidents at Glean are rare, however on the occasion we did have an incident, a few engineers just jumped on the problem and fixed it.

This worked until we had a large incident that took a few days to fix. When this incident happened, we noticed a few antipatterns:

Excitement from the Engineers. Because incidents were rare, EVERYONE wanted to get involved in the incident
Lack of process. As there wasn’t a process already in place, we had to make the process up on the spot. This required someone taking the lead, suggesting what would happen and getting buy-in from the Engineers. This wasted time when we could have been focusing on the incident
Lack of guidance around Engineer well-being. Incidents can involve people focusing intently on an issue for hours, or pairing for hours. Without someone looking out for the Engineers, they may get burnt out.

Once we resolved the incident, we looked into creating a lightweight incident process. This ensures that when an incident happens, we can focus on the incident itself instead of creating a process on the fly. If you are looking to write a lightweight incident process, you may be able to take some tips on how we manage incidents. We’ve found the incident process easy to follow and useful for not only the Engineers, but the rest of the business as well.

What to consider when writing an incident process

When to call an incident

Incidents are wasteful processes. When an incident is called, multiple engineers will have to drop what they are doing and context switch to something new. So, it's important to make sure that calling an incident is the right call. We call an incident when the following occurs:

Service level is impacted (including degraded)
A reasonable amount of customers are impacted

You’ll notice we haven’t set a percentage of customers that are impacted or levels or the service impacted. Calling an incident isn’t an exact science, incidents will (hopefully) always be different and so you will need to use experiences to ensure that starting an incident is the right call.

Who leads the incident?

We’ve assigned Incident Managers who can be contacted in case of an incident. They will make the call on whether the incident has been raised and administrate the incident. The Incident Manager’s responsibilities are as follows:

Creating the incident Slack Channel in the format #inc-{problem}-YYYYMMDD
Communication with stakeholders
Keeping track of tasks
Running War Rooms
Health of the core team - checking whether people need breaks, reminding people to eat etc. If it is out of hours and the team is working after 19:00 and it looks like it will go for longer, organising food
Organise a post incident retrospective if valuable
Update incident tracker spreadsheet with brief incident details

Who is involved?

At Glean, we have a “Core incident team” which is made up of our Architects, Engineering Effectiveness Engineers and Tech Leads. The purpose of this is to stem the excitement of an incident being called - where everyone wants to join and instead, the core incident team can call in team members that can help the problem.

How the incident team communicates

At Glean, we hold war rooms every two hours when there is an incident. In the first war room, the team identifies:

Who is impacted by the incident
What the priority of the incident is
What part of the system is impacted by the incident
Who will need to be involved to solve the incident
What the timeline of the incident is
Whether continuous deployment should continue. If it should be turned off, post that it is off in the Engineering Slack channel
Whether engineers should continue merging into master. If not, post in the Engineering Slack channel

The war room will produce some artefacts, being:

A task board with assigned tasks. The Incident Manager will identify who would like to work on what tasks and ask if they would like to pair
A public slack channel. The naming of the Slack channel will follow the format #inc-feature-date - for example, #inc-deltas-190921. The date will be the start of the War Room, not the start of the incident (as this could possibly be a long time ago)
For any non-routine changes we plan to make to the system, a brief summary of contingency plans to revert the change or fix-forward (to ensure we’ve given them a bit of thought)

Once the war room has completed, the Incident Manager will:

Set another war room for two hours time, only in working hours unless it is agreed that out of hours work is appropriate. War rooms will continue every two hours until the incident has resolved or the incident severity/priority has been downgraded
Send an update in Engineering Announcement channel that an incident is in progress
Send an update to the C level in Engineering that an incident is occurring with a high level update. This will be approved by members of the incident channel before being sent
Keep an eye on master to ensure it is not broken. If master is red, the Incident Manager will work to get master fixed (usually by finding someone to fix it)

If a data or security breach has occurred:

Also raise the incident in the Infosec channel and tag the Compliance team

Rules for the War Room are:

Anyone can call an earlier meeting
Meetings are every 2 hours unless new information is uncovered
Gentlemen, you can’t fight in here, this is the War Room!

As well as a war room, we also have the Slack channel for communication between war rooms. The Slack channel is always an open channel so anyone from around the business can jump in and see what is happening. We’ve also got rules for the incident channel being:

Important posts are pinned to the channel (timeline, root causes, userIds etc)
If a task is completed, tag the Incident Manager in Slack so they can keep track of the tasks
Before any communication with stakeholders, the communication will always be posted in Slack and at least one 👍 is needed
PII is never shared in the channel (email addresses, usernames etc)

How we communicate with the rest of the business

For all communication, communicate the amount of people affected, the service impact and priority. The baseline for incident communication is:

Alerting the support team via Slack
Escalating to the C level in Engineering. Communicating with the C level in Engineering at least daily
Ensuring that the Leadership team have been made aware

We’ve found setting up a process for incidents incredibly useful. We only have incidents every few months, but we learned from the first incident that it’s far easier to have a process in place so time isn’t wasted.

Written by Gwen Diagram

Author Bio

Deck-footer