Calm and focus in a crisis

@loganmeetsworld asked this on Twitter, and @karminker prodded me to share my thoughts. Alas, I had too many for tweets, so here are my ramblings.

I'm coming at this with a few things in mind:

  • To have focus, you must have calm.
  • The kind of focus you need will be different before, during, and after an incident.
  • Problem solving during an incident can be taught.
  • As with all things, there are first-timers and old-timers. Their needs may be different.

Calm

People are terrible problem solvers when they're stressed out. Even the best of us will become defensive, disorganised, or argumentative if we're pressured and feel we're backed into a corner. In the day to day operation of a business there is no situation quite like an incident, with its varied stressors and strains, for bringing these behaviours out in people.

Before you try to solve a problem, first make sure there is calm. There are a number of ways an Incident Manager can achieve this.

Prepare and train

Everyone should be drilled in how your company manages incidents. When that happens, it should be second nature to know what to do, how to behave, and what to expect. There's a reason that soldiers train, train, train on the same things repeatedly. Familiarity breeds skill and experience. 

That means a few things. Everyone should..

  • Know what an incident is and how you prioritise them.
  • Know how they will be made aware of a problem, or how they can find out about them.
  • Know who is in control of the situation and who is available to them.
  • Know what kind of behaviour is expected of them, and what kind of behaviour is unacceptable.
  • Know what the rule of thumb is on fixing forward or backing out changes.

If you're reading that list and you're not sure that your teams know these things, then you're not prepared. 

Clear communication, shared understanding

The best Incident Managers I've ever met were amazing facilitators and acutely aware of the need for regular communication, both internally and with customers. They will:

  • Sound the alarm.
  • Get the right people around a table or on a call.
  • Regularly pause the conversation to summarise the situation, decisions, actions, and understanding.
  • Regularly repeat the goal of the current discussion to keep people on track. They'll shut down 'rabbit hole' discussions, or move them to their own space.
  • Set a comms cadence, and stick to it. This applies both to internal and external stakeholders.
  • Facilitate discussion and agreement. They should only be the deciding vote in a situation where a disagreement stalls progress.

Optimise your tools

Having a clear understanding of what constitutes an incident is important. Once you have that, you can optimise your tools to alert you to these things automatically, drastically improving your response times and levels of awareness. If your alerting solution, backlog management, code management, customer support, and sales systems aren't talking to each other, you're missing out.

For example, the following is a real scenario that was played out with a previous employer. To this day I believe this is an amazing example of tools optimised for incident management.

  • There was a sharp increase in traffic to one cluster of servers. This surpassed our definition of 'OK' and moved into 'DDoS' territory.
  • An incident ticket was automatically created in our system for the issue, with logs, stats, and links to the last 24 hours of code pushes attached.
  • A Slack channel was automatically spun up, and all the nominated incident response contacts were added. A link to the incident was added to the channel.
  • SMS messages were sent to the same key contacts.
  • Service status was automatically changed to amber on our status page, with a message saying that we were aware of possible performance issues and were looking into it.
  • Customer Service and Sales platforms were updated with messages to flag the issue, with links to the incident ticket and Slack channel. I also flagged key accounts that use that cluster.

In the space of a few moments, the whole company was made aware of the issue to an appropriate level of detail. Basic levels of data gathering were done and shared. All of this without a single person touching a key on their keyboard. They were free to just take a seat and start the important work of working through through the problem.

Build trust

An often neglected part of being an effective Incident Manager is the ability to inspire trust and confidence with people you, most likely, only ever see when something goes wrong. When you gather people to solve a problem, they need to know that you can guide them, understand them, and keep them moving.


Focus

With all that rambling about creating calmness out of the way, let's answer @loganmeetsworld's real question. In my mind, it comes down to this: "how do we solve a problem when under pressure?"

First up, let's acknowledge that this is hard. All of our instincts make it difficult to logically work through a problem when we're stressed. Fight or flight. All of that stuff. 

For me, the key is to define together, up front, before an incident ever occurs, how you solve problems as a team. If you can agree that, and the whole team is drilled to work in that way, working logically on a problem becomes second nature.

For example, my personal preference for an incident management discussion is:

  • Summarise why we are talking
  • Outline five things we need to achieve in the first discussion:
    • An understanding of the impact (eg. 100 customers unable to use the product).
    • An understanding of any changes to Production code that took place (eg. the last 24 hours of pull requests). Are they relevant? Require proof.
    • An understanding of external events that took place (eg. a flood near the data centre). Are they relevant? Require proof.
    • What will our comms cadence be, and are we ready to share anything yet? (eg. internally every 30 mins, externally every hour)
    • What actions are we taking away, who will own each, and when do we regroup? What progress was made on previous actions?
  • Work through each of these points methodically. After each, summarise to the group. If someone joins the discussion part way through, pause to bring them up to speed.
  • Add notes to your ticket to summarise all of this, every time.
  • Keep repeating this formula each time you meet until you land on the cause. Then, switch to working through the actions to solve it. Be open to having parallel fix activities running.

Don't skip over the 'require proof' parts of this. Two words, big impact. It's OK to need proof, even for 'obvious' things.

I've also found the following to be very effective:

  • Reiterate the house rules for incident management and problem solving. Remind people of how we work together in these situations.
  • Enforce pairing. No person works alone. This means that people have the benefit of a sounding board and an ally whilst they figure something out, reducing the risk of taking an unproductive path.
  • Use positive framing. Things may not be great, but your choice of words will have a serious impact on the morale and thinking of the team.
  • Run retrospectives after every incident, and carve out time as part of that to understand where your ways of working were positively or negatively perceived. Iterate on the process!

So, there you have it. My 7am thoughts on the subject of incident management and problem solving.