Difference between revisions of "Incident Report: 20200304"

From URY Wiki
Jump to navigation Jump to search
(Created page with "{{Incident |brief=Dante is hard, mmkay? |severity=High |impact=High (Approx 15mins of dead air during a live show, also server reboot) |start=04/03/2020 21:17 |end=0...")
 
Line 33: Line 33:
  
 
= Timeline =
 
= Timeline =
 +
 +
19:00-21:00 MS and MP investigating CPU issues on Dolby.
 +
 +
21:00-21:10 sporadic bits of dead air on AM & online (some presenter error, some potentially Dante related)
 +
 +
21:17:00 MG reports “big issue” in Slack
 +
 +
21:18:35 Dante Controller reports “Device 001DC10208C1 (device name not known) is now a grandmaster” - the RedNet 3 has dropped off
 +
 +
21:18:36 MP screams “DANTE!”
 +
 +
21:19-21:20 MP turns the RedNet 3 off and on again
 +
 +
21:20 Dante starts flapping clock, but seemingly stabilises
 +
 +
21:28 The AM feed becomes noticeably sosig, slowly degrades quality
 +
 +
21:29-21:32 The AM goes to silence with occasional bursts of techno. At the same time, Dante Controller reports various clock switches and devices muting (which is Dante-speak for “oman i am no good with audio pls to halp”). The clock master cycles between Phil, Office, and Studio Red several times. MP says in Slack at 21:31 “[Dante] is not in a good place right now.”
 +
 +
Relevant log entry: 2020-03-04 21:30:53 GMT 1583357453638 Information "uryStores" "Timed out 3 times sending message 'UpdateRxChannels' to uryStores, giving up."
 +
 +
21:33:34 MP, MS, BA set Studio Blue to clock master. Dante sort-of stabilises except not really
 +
 +
21:36:21 IL reports that the clock coax between uryStores and Scarlett is hot to the touch.
 +
 +
Around this time (exact timing TBC) IL attempts switching out the coax, but fails when he realises that the replacement cable is a piece of shit.
 +
 +
21:38:25 we go to DEAD AIR, and at 21:39:14 Dearie-Me triggers.
 +
 +
Around 21:37-39 MP uses RedNet Control to switch the RedNet 3’s clock input to ADAT, believing this will help things, and at 21:38:49 Dante Controller reports that uryStores is clock master. This does not make the Scarlett happy.
 +
 +
21:39:56 Studio Blue is switched to Clock Master, confirmed by MP in Slack at 21:40:07. Still dead air.
 +
 +
21:42 MP tries an audio pipeline restart (aka startAudio.sh). This does not help.
 +
 +
21:44 MP makes the call to reboot Dolby.
 +
 +
21:46:42 Dolby boots to Linux.
 +
 +
21:47:04 MP SSHes into Dolby and tries starting audio, failing.
 +
 +
21:49:41 DEAD AIR ENDS - goes to horrible sosig Flagship News (from the TX OB)
 +
 +
21:50:17 MP runs startAudio.sh
 +
 +
21:50:50 uryStores becomes clock master, at 21:51:09 Dante finishes re-synchronising
 +
 +
21:52:00 horrible sosig Flagship News ends and AM has off-air loop
 +
 +
21:54-21:56 MP realises that SelectorListener isn’t running, and runs some commands to try and start it.
 +
 +
21:58:59 MP declares the incident over

Revision as of 18:28, 5 March 2020

Incident Report
Dante is hard, mmkay?
Summary
Severity High
Impact High (Approx 15mins of dead air during a live show, also server reboot)
Event Start 04/03/2020 21:17
Event End 04/03/2020 21:58
Recurrence Mitigation See body
Contacts
Recovery Leader Marks Polakovs (MP)
Other Attendees Matthew Stratford (MS), Isaac Lowe (IL), Michael Grace (MG), Jacob Dicker (JD)


Summary

While investigating CPU load issues on Dolby, the Stores RedNet 3, for seemingly unrelated reasons, drops off the network. This has been seen before. What hasn't been seen before is what happened next: after humming along perfectly happy for 9 minutes, the Dante network started re-electing clock master every 15 seconds, which lead to drop-outs. While trying to fix that, the clock input is switched from BNC to ADAT, which really upsets the Scarlett 18i20, eventually requiring a full reboot of Dolby to get back on air.

What Went Well

  • Given the scale of the hardware issues experienced, the fact that we had a total of 15 minutes of dead air is not ideal but could be a lot worse.
  • However, a lot of that was off-air-loop, which, while technically not dead air, isn't exactly broadcasting. TODO exact numbers, but somewhere around 10m of that.

What Did Not Go So Well

  • Took us a while to realise that the RedNet had failed (especially since the error is layer 7 it wouldn't trigger monitoring as that does layer 4)
  • Switching clock input to ADAT was a mistake and delayed recovery quite substantially

How We Got Lucky

  • Almost all current experienced computing team members were in the station and ready to respond quickly

Analysis

Actions

Timeline

19:00-21:00 MS and MP investigating CPU issues on Dolby.

21:00-21:10 sporadic bits of dead air on AM & online (some presenter error, some potentially Dante related)

21:17:00 MG reports “big issue” in Slack

21:18:35 Dante Controller reports “Device 001DC10208C1 (device name not known) is now a grandmaster” - the RedNet 3 has dropped off

21:18:36 MP screams “DANTE!”

21:19-21:20 MP turns the RedNet 3 off and on again

21:20 Dante starts flapping clock, but seemingly stabilises

21:28 The AM feed becomes noticeably sosig, slowly degrades quality

21:29-21:32 The AM goes to silence with occasional bursts of techno. At the same time, Dante Controller reports various clock switches and devices muting (which is Dante-speak for “oman i am no good with audio pls to halp”). The clock master cycles between Phil, Office, and Studio Red several times. MP says in Slack at 21:31 “[Dante] is not in a good place right now.”

Relevant log entry: 2020-03-04 21:30:53 GMT 1583357453638 Information "uryStores" "Timed out 3 times sending message 'UpdateRxChannels' to uryStores, giving up."

21:33:34 MP, MS, BA set Studio Blue to clock master. Dante sort-of stabilises except not really

21:36:21 IL reports that the clock coax between uryStores and Scarlett is hot to the touch.

Around this time (exact timing TBC) IL attempts switching out the coax, but fails when he realises that the replacement cable is a piece of shit.

21:38:25 we go to DEAD AIR, and at 21:39:14 Dearie-Me triggers.

Around 21:37-39 MP uses RedNet Control to switch the RedNet 3’s clock input to ADAT, believing this will help things, and at 21:38:49 Dante Controller reports that uryStores is clock master. This does not make the Scarlett happy.

21:39:56 Studio Blue is switched to Clock Master, confirmed by MP in Slack at 21:40:07. Still dead air.

21:42 MP tries an audio pipeline restart (aka startAudio.sh). This does not help.

21:44 MP makes the call to reboot Dolby.

21:46:42 Dolby boots to Linux.

21:47:04 MP SSHes into Dolby and tries starting audio, failing.

21:49:41 DEAD AIR ENDS - goes to horrible sosig Flagship News (from the TX OB)

21:50:17 MP runs startAudio.sh

21:50:50 uryStores becomes clock master, at 21:51:09 Dante finishes re-synchronising

21:52:00 horrible sosig Flagship News ends and AM has off-air loop

21:54-21:56 MP realises that SelectorListener isn’t running, and runs some commands to try and start it.

21:58:59 MP declares the incident over