Incident Report: 20200304

Revision as of 17:26, 5 March 2020 by Marks Polakovs (11090) (talk | contribs) (Created page with "{{Incident |brief=Dante is hard, mmkay? |severity=High |impact=High (Approx 15mins of dead air during a live show, also server reboot) |start=04/03/2020 21:17 |end=0...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Incident Report
Dante is hard, mmkay?
Summary
Severity High
Impact High (Approx 15mins of dead air during a live show, also server reboot)
Event Start 04/03/2020 21:17
Event End 04/03/2020 21:58
Recurrence Mitigation See body
Contacts
Recovery Leader Marks Polakovs (MP)
Other Attendees Matthew Stratford (MS), Isaac Lowe (IL), Michael Grace (MG), Jacob Dicker (JD)


Summary

While investigating CPU load issues on Dolby, the Stores RedNet 3, for seemingly unrelated reasons, drops off the network. This has been seen before. What hasn't been seen before is what happened next: after humming along perfectly happy for 9 minutes, the Dante network started re-electing clock master every 15 seconds, which lead to drop-outs. While trying to fix that, the clock input is switched from BNC to ADAT, which really upsets the Scarlett 18i20, eventually requiring a full reboot of Dolby to get back on air.

What Went Well

  • Given the scale of the hardware issues experienced, the fact that we had a total of 15 minutes of dead air is not ideal but could be a lot worse.
  • However, a lot of that was off-air-loop, which, while technically not dead air, isn't exactly broadcasting. TODO exact numbers, but somewhere around 10m of that.

What Did Not Go So Well

  • Took us a while to realise that the RedNet had failed (especially since the error is layer 7 it wouldn't trigger monitoring as that does layer 4)
  • Switching clock input to ADAT was a mistake and delayed recovery quite substantially

How We Got Lucky

  • Almost all current experienced computing team members were in the station and ready to respond quickly

Analysis

Actions

Timeline