Incident Response Guide

From URY Wiki
Revision as of 20:49, 15 May 2020 by Marks Polakovs (11090) (talk | contribs) (Created page with "'''The URY Guide to Putting Out Fires Efficiently''' ''This guide does not apply if there is an actual fire. In that case, run.'' = What is an incident? = An incident is an...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The URY Guide to Putting Out Fires Efficiently

This guide does not apply if there is an actual fire. In that case, run.

What is an incident?

An incident is any technical problem that needs to be fixed urgently, i.e. we can't say "we'll fix it next week". This could be as serious as dead air, or something less serious but still serious.

In the event of Dearie Me or another studio fault message or incident involving Computing/Engineering, that can’t be solved quickly, a recovery leader must be established.

If you feel like you don’t know enough to fix the problem yourself, don’t be afraid to escalate, usually with a message in #studio-faults. If it’s a serious fault (dead air), @channel.

If you know the solution is very quick/simple, i.e. “Help! The presenter mic in red is wobbly!”, Don’t waste time declaring an recovery leader. In this case the response is “yes we know, we’ll send an engineer in with a tiny Ben Allen Key soon”.

Otherwise…

Criteria for recovery leader

  • One of the first people aware of the situation (if something has occured at an obscure hour then probably the only person aware)
  • Has enough technical know how to decide which of the many suggestions thrown their way is best to carry out first.
  • Must relay clear instructions to the Incident Worker/s.
  • Their job is to see the bigger picture (we need to get back on air asap and then we can start to troubleshoot and fix).
  • This means that the most qualified person for fixing the problem isn’t always the best person to be IR, as they may get tunnel vision while working on the problem
  • Need to be able to keep all respondents on track.

If you feel you are best placed in a given incident to be the recovery leader, please elect yourself with consideration for others i.e. say “Does anyone mind if I take on recovery leader”.

Unless there are any severe objections all respondents should Listen to the instructions of the recovery leader and accept that their decision is final.

Response Procedure

If you are on site

You will be able to make use of those who are already on the scene or arrive to deal with it. But it may be that someone offers to respond and you need to decide if their presence will be needed/constructive.

There could also be useful alumni and others communicating via slack (start a thread for them to air their thoughts).

Delegate each task to an appropriate person who isn’t busy and is capable of doing it, i.e. “X, Can you do Y please?”. Don’t ever say “can someone do Y”, it’ll never get done, always name a specific person.

If we are working remotely

There will only be one attendee.

You will need to create two threads in Engineering, Computing, or Studio Faults (whichever you deem most appropriate). One thread for all respondents to discuss their ideas/solutions and another for you to convey your final decision to the Incident Worker or those remotely accessing equipment.

If a call is needed then start a google meet and post the link, please record the call.

Take the time to write instructions to clearly and avoid the use of abbreviations or unfamiliar technical terms.

Either way

Listen to what everyone is saying and decide what is the best next step to take. Good questions to ask: “What’s wrong?” “How could we fix it?” “What are the risks associated?” “Does anyone have any strong objections to doing X?”

Remember to see the bigger picture. As tempting as it may be, don’t get sidetracked with actually fixing the problem or discussing other issues that are less pressing.

Keep an eye on the wellbeing on your team. If someone is getting too stressed or panicky, don’t be afraid to ask them to take a break. The same goes for you. Don’t be afraid to ask someone else to become IC, hand over, and leave. Don’t wear yourself out.

Once the situation is resolved, debrief all respondents with ‘What Went Well’ and ‘Even Better If’ and thank them all dearly. If further discussion is necessary, it will take place in either a Computing or Engineering meeting (whichever is closest/most relevant).

Use call recording/slack messages/notes to submit an incident report via the wiki Category:Incident Reports. If you have never done one of these before, include a timeline of events including dead air, a description of actions taken and their results and some form of debrief (use previous reports as a model).