Changes

Jump to navigation Jump to search
Created page with "{{Incident |brief=Switch reboot took down DNS |severity=Moderate |impact=High (Dead air for around 2 minutes) |start=25/02/2017 16:29 |end=25/02/2017 17:30 |mitiga..."
{{Incident
|brief=Switch reboot took down DNS
|severity=Moderate
|impact=High (Dead air for around 2 minutes)
|start=25/02/2017 16:29
|end=25/02/2017 17:30
|mitigation=Reduce dependency on uplink
|leader=Connor Sanders (CS)
|others=Isaac Lowe (IL)
}}

(Total dead air: 05:02:02-05:46)

== Summary ==

At 05:00:00, our IT Services uplink switch, urysw4, rebooted for a regularly scheduled update (that nobody was aware of because we weren't on the ITS comms list).

AutoSwitcher had started dutifully doing the news at 04:59:45, and was preparing to finish doing the news at 05:02:00. It tried to switch back from WebStudio (the news is layered over a silent WebStudio source... don't ask) to Jukebox, but it found that, since our uplink was down, it couldn't reach the campus DNS servers, thus couldn't resolve ''selector.york.ac.uk'', and thus couldn't switch back.

That left us with an empty WebStudio source broadcasting. Liquidsoap detected the silence (and sent a rather beautiful "Source 0 was on air" silence email), but couldn't switch back to Jukebox for the same reason. The switch finished rebooting at 05:04:30, but we were stuck on dead air. Dearie-Me, for some inexplicable reason, didn't fire until 5:10 (presumably static was keeping it from hitting the threshold), and CS woke up and saw the alerts at 05:37, before switching back to Jukebox at 05:46 and investigating with the assistance of IL.

== Reoccurrence mitigation ==

* Reduce dependency on upstream services
:* Investigate a local caching DNS resolver?
* Ask ITS kindly to tell us when they take down our campus uplink
* Ask ITS kindly to make it reboot at xx:30 instead of xx:00
* Improve documentation and logging of the new WebStudio services, to make future troubleshooting easier
* Reduce log spamminess of Dearie-Me, it filled up its journald buffer quite quickly

[[Category:Incident Reports]]

Navigation menu