Incident Report: 20200511

Incident Report
Switch reboot took down DNS
Summary
Severity Moderate
Impact High (Dead air for around 45 minutes)
Event Start 2020-05-11 05:00
Event End 202-05-11 05:46
Recurrence Mitigation Reduce dependency on uplink
Contacts
Recovery Leader Connor Sanders (CS)
Other Attendees Isaac Lowe (IL)


(Total dead air: 05:02:02-05:46)

Summary

At 05:00:00, our IT Services uplink switch, urysw4, rebooted for a regularly scheduled update (that nobody was aware of because we weren't on the ITS comms list).

AutoSwitcher had started dutifully doing the news at 04:59:45, and was preparing to finish doing the news at 05:02:00. It tried to switch back from WebStudio (the news is layered over a silent WebStudio source... don't ask) to Jukebox, but it found that, since our uplink was down, it couldn't reach the campus DNS servers, thus couldn't resolve selector.york.ac.uk, and thus couldn't switch back.

That left us with an empty WebStudio source broadcasting. Liquidsoap detected the silence (and sent a rather beautiful "Source 0 was on air" silence email), but couldn't switch back to Jukebox for the same reason. The switch finished rebooting at 05:04:30, but we were stuck on dead air. Dearie-Me, for some inexplicable reason, didn't fire until 5:10 (presumably static was keeping it from hitting the threshold), and CS woke up and saw the alerts at 05:37, before switching back to Jukebox at 05:46 and investigating with the assistance of IL.

Reoccurrence mitigation

  • Reduce dependency on upstream services
  • Investigate a local caching DNS resolver?
  • Ask ITS kindly to tell us when they take down our campus uplink
  • Ask ITS kindly to make it reboot at xx:30 instead of xx:00
  • Improve documentation and logging of the new WebStudio services, to make future troubleshooting easier
  • Reduce log spamminess of Dearie-Me, it filled up its journald buffer quite quickly