Incident Report: 20200511
Incident Report | |
---|---|
Switch reboot took down DNS | |
Summary | |
Severity | Moderate |
Impact | High (Dead air for around 2 minutes) |
Event Start | 25/02/2017 16:29 |
Event End | 25/02/2017 17:30 |
Recurrence Mitigation | Reduce dependency on uplink |
Contacts | |
Recovery Leader | Connor Sanders (CS) |
Other Attendees | Isaac Lowe (IL) |
(Total dead air: 05:02:02-05:46)
Summary
At 05:00:00, our IT Services uplink switch, urysw4, rebooted for a regularly scheduled update (that nobody was aware of because we weren't on the ITS comms list).
AutoSwitcher had started dutifully doing the news at 04:59:45, and was preparing to finish doing the news at 05:02:00. It tried to switch back from WebStudio (the news is layered over a silent WebStudio source... don't ask) to Jukebox, but it found that, since our uplink was down, it couldn't reach the campus DNS servers, thus couldn't resolve selector.york.ac.uk, and thus couldn't switch back.
That left us with an empty WebStudio source broadcasting. Liquidsoap detected the silence (and sent a rather beautiful "Source 0 was on air" silence email), but couldn't switch back to Jukebox for the same reason. The switch finished rebooting at 05:04:30, but we were stuck on dead air. Dearie-Me, for some inexplicable reason, didn't fire until 5:10 (presumably static was keeping it from hitting the threshold), and CS woke up and saw the alerts at 05:37, before switching back to Jukebox at 05:46 and investigating with the assistance of IL.
Reoccurrence mitigation
- Reduce dependency on upstream services
- Investigate a local caching DNS resolver?
- Ask ITS kindly to tell us when they take down our campus uplink
- Ask ITS kindly to make it reboot at xx:30 instead of xx:00
- Improve documentation and logging of the new WebStudio services, to make future troubleshooting easier
- Reduce log spamminess of Dearie-Me, it filled up its journald buffer quite quickly