Incident Report: 20200511

From URY Wiki
Jump to navigation Jump to search
Incident Report
Switch reboot took down DNS, which brooke'd selector
Summary
Severity Moderate
Impact High (Dead air for around 45 minutes)
Event Start 2020-05-11 05:00
Event End 2020-05-11 05:46
Recurrence Mitigation Reduce dependency on uplink
Contacts
Recovery Leader Connor Sanders (CS)
Other Attendees Isaac Lowe (IL), Marks Polakovs (MP)


Summary

At 05:00:00, our IT Services uplink switch, urysw4, rebooted for a regularly scheduled update (that nobody was aware of because we weren't on the ITS comms list).

AutoSwitcher had started dutifully doing the news at 04:59:45, and was preparing to finish doing the news at 05:02:00. It tried to switch back from WebStudio (the news is layered over a silent WebStudio source... don't ask) to Jukebox, but it found that, since our uplink was down, it couldn't reach the campus DNS servers, thus couldn't resolve selector.york.ac.uk, and thus couldn't switch back.

That left us with an empty WebStudio source broadcasting. Liquidsoap detected the silence (and sent a rather beautiful "Source 0 was on air" silence email), but couldn't switch back to Jukebox for the same reason. The switch finished rebooting at 05:04:30, but we were stuck on dead air. Dearie-Me, for some inexplicable reason, didn't fire until 5:10 (presumably static was keeping it from hitting the threshold), and CS woke up and saw the alerts at 05:37, before switching back to Jukebox at 05:46 and investigating with the assistance of IL.

Reoccurrence mitigation

  • Reduce dependency on upstream services
  • Investigate a local caching DNS resolver?
  • MP - done-ish, running unbound on uryfw0 and many (but not all boxes use it)
  • Ask ITS nicely to tell us when they take down our campus uplink
  • MP - done
  • Ask ITS nicely to make it reboot at xx:30 instead of xx:00
  • MP - done
  • Improve documentation and logging of the new WebStudio services, to make future troubleshooting easier
  • Figure out why Dearie-Me didn't fire - possibly needs a recalibrate
  • Reduce log spamminess of Dearie-Me, it filled up its journald buffer quite quickly

Timings

                  HH:MM:SS
 Dead air start:  05:02:06.500
 Dead air end:    05:45:42.000
 TOTAL:           00:43:35.500