Incident Report: 20200304: Difference between revisions
| No edit summary | |||
| (45 intermediate revisions by the same user not shown) | |||
| Line 2: | Line 2: | ||
|    |brief=Dante is hard, mmkay? |    |brief=Dante is hard, mmkay? | ||
|    |severity=High |    |severity=High | ||
|    |impact=High (Approx 15mins of dead air  |    |impact=High (Approx 15mins of dead air, one ruined live show, also server reboot) | ||
|    |start=04/03/2020 21:17 |    |start=04/03/2020 21:17 | ||
|    |end=04/03/2020 21:58 |    |end=04/03/2020 21:58 | ||
|    |mitigation=See body |    |mitigation=See body | ||
|    |leader=Marks Polakovs (MP) |    |leader=Marks Polakovs (MP) | ||
|    |others=Matthew Stratford (MS), Isaac Lowe (IL), Michael Grace (MG), Jacob Dicker (JD) |    |others=Matthew Stratford (MS), Isaac Lowe (IL), Michael Grace (MG), Jacob Dicker (JD), Jess Schofield (JS), Ben Allen (BA) | ||
| }} | }} | ||
| ''' Note: report is still a work-in-progress, expect changes. Contributions are welcome, please inform MP. ''' | |||
| = Summary = | = Summary = | ||
| Line 17: | Line 19: | ||
| * Given the scale of the hardware issues experienced, the fact that we had a total of 15 minutes of dead air is not ideal but could be a lot worse. | * Given the scale of the hardware issues experienced, the fact that we had a total of 15 minutes of dead air is not ideal but could be a lot worse. | ||
| :* However, a lot of that was off-air-loop, which, while ''technically'' not dead air, isn't exactly  | :* However, a lot of that was off-air-loop, sine-wave, or Flagship News Sosij, which, while ''technically'' not dead air, isn't exactly quality broadcast output. TODO exact numbers, but somewhere around 10m of that. | ||
| == What Did Not Go So Well == | == What Did Not Go So Well == | ||
| Line 23: | Line 25: | ||
| * Took us a while to realise that the RedNet had failed (especially since the error is layer 7 it wouldn't trigger monitoring as that does layer 4) | * Took us a while to realise that the RedNet had failed (especially since the error is layer 7 it wouldn't trigger monitoring as that does layer 4) | ||
| * Switching clock input to ADAT was a mistake and delayed recovery quite substantially | * Switching clock input to ADAT was a mistake and delayed recovery quite substantially | ||
| * Little insight into what happened with the Scarlett, we saw some Jack errors, diagnosed probably hardware/driver error, and made call to reboot system in order to get back on air | |||
| :* journald is not set to persistent, so the only post-hoc logs we have are from after the reboot | |||
| * TX OB was delayed | |||
| == How We Got Lucky == | == How We Got Lucky == | ||
| * Almost all current experienced computing team members were in the station and ready to respond quickly | * Almost all current experienced computing team members were in the station and ready to respond quickly | ||
| * Clocking resolved itself, sooner or later, as we have no idea what actually caused it | |||
| = Analysis = | = Analysis = | ||
| The most likely culprit in this scenario is some kind of hardware issue, potentially electrical fault in the RedNet 3. This would explain why it was cycling between link DOWN and UP on the network - it's possible that shifting logic levels inside confused the ethernet controller and/or sent invalid Layer 1 signals. It's possible that during the clocking storm (21:29-21:32) the RedNet was disappearing from the network, triggering a Dante clock master election, then re-appearing, establishing itself as master, before leaving again. (This does not explain, however, why the master would cycle between Phil, Office, and StudioRed, as one would hope Dante can at least pick one as master and stick with it.) This would also explain the hot cable - normally a word clock signal is a mere 5V, not even close to sufficient for heating it up. The noise heard during the clocking storm also sounds digital, almost like dropping packets. Anecdotally, we've had issues with RedNet power supplies in the past. | |||
| = Actions = | = Actions = | ||
| * MP to remove song that caused Campus Playout bug and high Dolby CPU usage from rotation - '''Done''' | |||
| * MP to set up emergency back-up audio player on BSOD - '''Done''' | |||
| * MS to finish hardware dead-air detector - '''Done''' | |||
| * JS and BA to contact Focusrite Support about RedNet issues - '''Done''' | |||
| * BA to replace clock coax with high-quality SDI cable | |||
| * MP to set up Journald persistence on Dolby (and ideally all Debian boxen) - '''done''' on Dolby, on hold on rest | |||
| * Purchase alternative Dante<->ADAT interface - '''Done''' | |||
| * MP and MS to sort out SelectorListener startup - on hold until off-air | |||
| = Timeline = | = Timeline = | ||
| [https://docs.google.com/spreadsheets/d/1P61vGuJai7zTar07iOwK8H1sNvb5J2WGZ78o6hJRBiA/edit#gid=0 All logs] | |||
| 19:00-21:00 MS and MP investigating CPU issues on Dolby. | 19:00-21:00 MS and MP investigating CPU issues on Dolby. | ||
| Line 38: | Line 57: | ||
| 21:00-21:10 sporadic bits of dead air on AM & online (some presenter error, some potentially Dante related) | 21:00-21:10 sporadic bits of dead air on AM & online (some presenter error, some potentially Dante related) | ||
| 21:17:00 MG reports “big issue” in Slack | 21:17:00 '''MG reports “big issue” in Slack''' | ||
| 21:18:35 Dante Controller reports “Device 001DC10208C1 (device name not known) is now a grandmaster” - the RedNet 3 has dropped off | 21:18:35 Dante Controller reports “Device 001DC10208C1 (device name not known) is now a grandmaster” - the RedNet 3 has dropped off | ||
| : In retrospect, I (MP) don't think this was the entry that confirmed R3 dropoff, as the timestamp was just when Dante Controller was started. However, it's hard to pin down the exact time, as the switch logs show the R3 flapping on and off the network repeatedly. | |||
| 21:18:36 MP screams “DANTE!” | 21:18:36 MP screams “DANTE!” | ||
| 21:19-21:20 MP turns the RedNet 3 off and on again | 21:19-21:20 '''MP turns the RedNet 3 off and on again''' | ||
| 21:20 Dante starts flapping clock, but seemingly stabilises | 21:20 Dante starts flapping clock, but seemingly stabilises | ||
| Line 50: | Line 71: | ||
| 21:28 The AM feed becomes noticeably sosig, slowly degrades quality | 21:28 The AM feed becomes noticeably sosig, slowly degrades quality | ||
| 21:29-21:32 The AM goes to  | 21:29-21:32 The AM goes to '''DEAD AIR''' with occasional bursts of techno. At the same time, Dante Controller reports '''various clock switches and devices muting''' (which is Dante-speak for “oman i am no good with audio pls to halp”). The clock master cycles between Phil, Office, and Studio Red several times. MP says in Slack at 21:31 “[Dante] is not in a good place right now.” | ||
| : Right as the clocking storm starts, the switch starts reporting <code>IFNET Error LINK_UPDOWN GigabitEthernet1/0/11 link status is DOWN.</code> - port 11 is uryStores. Later <code>IFNET Error LINK_UPDOWN GigabitEthernet1/0/11 link status is UP.</code> and then DOWN again, ad infinitum. | |||
| Relevant log entry: 2020-03-04 21:30:53 GMT	1583357453638	Information	"uryStores"	"Timed out 3 times sending message 'UpdateRxChannels' to uryStores, giving up." | : Relevant log entry: <code>2020-03-04 21:30:53 GMT	1583357453638	Information	"uryStores"	"Timed out 3 times sending message 'UpdateRxChannels' to uryStores, giving up."</code> | ||
| 21:33:34 MP, MS, BA set Studio Blue to clock master. Dante sort-of stabilises except not really | 21:33:34 MP, MS, BA set Studio Blue to clock master. Dante sort-of stabilises except not really | ||
| 21:36:21 IL reports that the clock coax between uryStores and Scarlett is hot to the touch. | 21:36:21 '''IL reports that the clock coax between uryStores and Scarlett is hot to the touch.''' | ||
| : "Boiling" was the word he used during the debrief - "not enough to burn me, but certainly enough to be concerning". | |||
| 21:38:25 '''we go to DEAD AIR''', and at 21:39:14 Dearie-Me triggers. | |||
| Around 21:37-39 MP uses RedNet Control to switch the RedNet 3’s clock input to ADAT, believing this will help things. This does not make the Scarlett happy. | |||
| : JD speculates that this was because, when we switched clocking to ADAT, the RedNet was trying to feed the Scarlett clock via ADAT at the same time that the Scarlett was trying to feed the RedNet clock via BNC (the audio equivalent of two DHCP servers on one network). | |||
| Around this time IL attempts switching out the coax, but fails when he realises that the replacement cable is a piece of shit (frayed ends), and, after noticing we'd gone to dead air, switched it back. | |||
| : During the debrief IL reported that, looking at the front panel of the Scarlett, no lights were on - not even the AM return feed, which should have at least some signal (modulation noise) even in the event of dead air | |||
| During this time IL and BA are setting up a TX OB (in layman's terms, shove a microphone directly into the transmitter to get *some* signal on air). Much running between office and Stores to gather equipment ensues. | |||
| 21:38:49 Dante Controller reports that uryStores is clock master, as the team is oblivious to the ongoing Scarlett cataclysm. | |||
| 21:39:56 Studio Blue is switched to Clock Master, confirmed by MP in Slack at 21:40:07. Still dead air. | 21:39:56 Studio Blue is switched to Clock Master, confirmed by MP in Slack at 21:40:07. Still dead air. | ||
| 21:42 MP tries an audio pipeline restart (aka startAudio.sh). This does not help. | 21:42 MP tries an audio pipeline restart (aka startAudio.sh). This does not help. | ||
| : Gracefully stopping Jack via systemd fails and MP has to kill it, probably due to driver issues. | |||
| :: jack_lsp reported <code>jack_client_open() failed, status = 0x21</code> | |||
| : During the debrief BA speculated that, although it was showing up in as a device (although MP has no logs of this), the Scarlett had probably borked itself completely due to the double-clocking. | |||
| 21:44 MP makes the call to reboot Dolby. | 21:44 MP makes the call to reboot Dolby. | ||
| Line 74: | Line 113: | ||
| 21:47:04 MP SSHes into Dolby and tries starting audio, failing. | 21:47:04 MP SSHes into Dolby and tries starting audio, failing. | ||
| 21:49:41 DEAD AIR ENDS - goes to horrible sosig Flagship News (from the TX OB) | 21:49:41 '''DEAD AIR ENDS''' - goes to horrible sosig Flagship News (from the TX OB) | ||
| 21:50:17 MP runs startAudio.sh | 21:50:17 MP runs startAudio.sh | ||
| Line 82: | Line 121: | ||
| 21:52:00 horrible sosig Flagship News ends and AM has off-air loop | 21:52:00 horrible sosig Flagship News ends and AM has off-air loop | ||
| 21:54-21:56 MP realises that SelectorListener isn’t running, and runs some commands to try and start it. | 21:54-21:56 MP realises that SelectorListener isn’t running, and runs some commands to try and start it. Selector cycles between Off-Air, Sine Wave, and Jukebox as MP tests that it works. | ||
| 21:58:01 MP selects Jukebox. | |||
| 21:58:59 '''MP declares the incident over''' | |||
| [[Category:Incident Reports]] | |||
Latest revision as of 09:54, 7 July 2020
| Incident Report | |
|---|---|
| Dante is hard, mmkay? | |
| Summary | |
| Severity | High | 
| Impact | High (Approx 15mins of dead air, one ruined live show, also server reboot) | 
| Event Start | 04/03/2020 21:17 | 
| Event End | 04/03/2020 21:58 | 
| Recurrence Mitigation | See body | 
| Contacts | |
| Recovery Leader | Marks Polakovs (MP) | 
| Other Attendees | Matthew Stratford (MS), Isaac Lowe (IL), Michael Grace (MG), Jacob Dicker (JD), Jess Schofield (JS), Ben Allen (BA) | 
 Note: report is still a work-in-progress, expect changes. Contributions are welcome, please inform MP. 
Summary
While comp team were investigating CPU load issues on Dolby, the Stores RedNet 3, for seemingly unrelated reasons, drops off the network. This has been seen before. What hasn't been seen before is what happened next: after humming along perfectly happy for 9 minutes, the Dante network started re-electing clock master every 15 seconds, which lead to drop-outs. While trying to fix that, the clock input is switched from BNC to ADAT, which really upsets the Scarlett 18i20, eventually requiring a full reboot of Dolby to get back on air.
What Went Well
- Given the scale of the hardware issues experienced, the fact that we had a total of 15 minutes of dead air is not ideal but could be a lot worse.
- However, a lot of that was off-air-loop, sine-wave, or Flagship News Sosij, which, while technically not dead air, isn't exactly quality broadcast output. TODO exact numbers, but somewhere around 10m of that.
 
What Did Not Go So Well
- Took us a while to realise that the RedNet had failed (especially since the error is layer 7 it wouldn't trigger monitoring as that does layer 4)
- Switching clock input to ADAT was a mistake and delayed recovery quite substantially
- Little insight into what happened with the Scarlett, we saw some Jack errors, diagnosed probably hardware/driver error, and made call to reboot system in order to get back on air
- journald is not set to persistent, so the only post-hoc logs we have are from after the reboot
 
- TX OB was delayed
How We Got Lucky
- Almost all current experienced computing team members were in the station and ready to respond quickly
- Clocking resolved itself, sooner or later, as we have no idea what actually caused it
Analysis
The most likely culprit in this scenario is some kind of hardware issue, potentially electrical fault in the RedNet 3. This would explain why it was cycling between link DOWN and UP on the network - it's possible that shifting logic levels inside confused the ethernet controller and/or sent invalid Layer 1 signals. It's possible that during the clocking storm (21:29-21:32) the RedNet was disappearing from the network, triggering a Dante clock master election, then re-appearing, establishing itself as master, before leaving again. (This does not explain, however, why the master would cycle between Phil, Office, and StudioRed, as one would hope Dante can at least pick one as master and stick with it.) This would also explain the hot cable - normally a word clock signal is a mere 5V, not even close to sufficient for heating it up. The noise heard during the clocking storm also sounds digital, almost like dropping packets. Anecdotally, we've had issues with RedNet power supplies in the past.
Actions
- MP to remove song that caused Campus Playout bug and high Dolby CPU usage from rotation - Done
- MP to set up emergency back-up audio player on BSOD - Done
- MS to finish hardware dead-air detector - Done
- JS and BA to contact Focusrite Support about RedNet issues - Done
- BA to replace clock coax with high-quality SDI cable
- MP to set up Journald persistence on Dolby (and ideally all Debian boxen) - done on Dolby, on hold on rest
- Purchase alternative Dante<->ADAT interface - Done
- MP and MS to sort out SelectorListener startup - on hold until off-air
Timeline
19:00-21:00 MS and MP investigating CPU issues on Dolby.
21:00-21:10 sporadic bits of dead air on AM & online (some presenter error, some potentially Dante related)
21:17:00 MG reports “big issue” in Slack
21:18:35 Dante Controller reports “Device 001DC10208C1 (device name not known) is now a grandmaster” - the RedNet 3 has dropped off
- In retrospect, I (MP) don't think this was the entry that confirmed R3 dropoff, as the timestamp was just when Dante Controller was started. However, it's hard to pin down the exact time, as the switch logs show the R3 flapping on and off the network repeatedly.
21:18:36 MP screams “DANTE!”
21:19-21:20 MP turns the RedNet 3 off and on again
21:20 Dante starts flapping clock, but seemingly stabilises
21:28 The AM feed becomes noticeably sosig, slowly degrades quality
21:29-21:32 The AM goes to DEAD AIR with occasional bursts of techno. At the same time, Dante Controller reports various clock switches and devices muting (which is Dante-speak for “oman i am no good with audio pls to halp”). The clock master cycles between Phil, Office, and Studio Red several times. MP says in Slack at 21:31 “[Dante] is not in a good place right now.”
- Right as the clocking storm starts, the switch starts reporting IFNET Error LINK_UPDOWN GigabitEthernet1/0/11 link status is DOWN.- port 11 is uryStores. LaterIFNET Error LINK_UPDOWN GigabitEthernet1/0/11 link status is UP.and then DOWN again, ad infinitum.
- Relevant log entry: 2020-03-04 21:30:53 GMT 1583357453638 Information "uryStores" "Timed out 3 times sending message 'UpdateRxChannels' to uryStores, giving up."
21:33:34 MP, MS, BA set Studio Blue to clock master. Dante sort-of stabilises except not really
21:36:21 IL reports that the clock coax between uryStores and Scarlett is hot to the touch.
- "Boiling" was the word he used during the debrief - "not enough to burn me, but certainly enough to be concerning".
21:38:25 we go to DEAD AIR, and at 21:39:14 Dearie-Me triggers.
Around 21:37-39 MP uses RedNet Control to switch the RedNet 3’s clock input to ADAT, believing this will help things. This does not make the Scarlett happy.
- JD speculates that this was because, when we switched clocking to ADAT, the RedNet was trying to feed the Scarlett clock via ADAT at the same time that the Scarlett was trying to feed the RedNet clock via BNC (the audio equivalent of two DHCP servers on one network).
Around this time IL attempts switching out the coax, but fails when he realises that the replacement cable is a piece of shit (frayed ends), and, after noticing we'd gone to dead air, switched it back.
- During the debrief IL reported that, looking at the front panel of the Scarlett, no lights were on - not even the AM return feed, which should have at least some signal (modulation noise) even in the event of dead air
During this time IL and BA are setting up a TX OB (in layman's terms, shove a microphone directly into the transmitter to get *some* signal on air). Much running between office and Stores to gather equipment ensues.
21:38:49 Dante Controller reports that uryStores is clock master, as the team is oblivious to the ongoing Scarlett cataclysm.
21:39:56 Studio Blue is switched to Clock Master, confirmed by MP in Slack at 21:40:07. Still dead air.
21:42 MP tries an audio pipeline restart (aka startAudio.sh). This does not help.
- Gracefully stopping Jack via systemd fails and MP has to kill it, probably due to driver issues.
- jack_lsp reported jack_client_open() failed, status = 0x21
 
- jack_lsp reported 
- During the debrief BA speculated that, although it was showing up in as a device (although MP has no logs of this), the Scarlett had probably borked itself completely due to the double-clocking.
21:44 MP makes the call to reboot Dolby.
21:46:42 Dolby boots to Linux.
21:47:04 MP SSHes into Dolby and tries starting audio, failing.
21:49:41 DEAD AIR ENDS - goes to horrible sosig Flagship News (from the TX OB)
21:50:17 MP runs startAudio.sh
21:50:50 uryStores becomes clock master, at 21:51:09 Dante finishes re-synchronising
21:52:00 horrible sosig Flagship News ends and AM has off-air loop
21:54-21:56 MP realises that SelectorListener isn’t running, and runs some commands to try and start it. Selector cycles between Off-Air, Sine Wave, and Jukebox as MP tests that it works.
21:58:01 MP selects Jukebox.
21:58:59 MP declares the incident over