Difference between revisions of "Incident Report: 20200511"
(8 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
{{Incident | {{Incident | ||
− | |brief=Switch reboot took down DNS | + | |brief=Switch reboot took down DNS, which brooke'd selector |
|severity=Moderate | |severity=Moderate | ||
|impact=High (Dead air for around 45 minutes) | |impact=High (Dead air for around 45 minutes) | ||
− | |start= | + | |start=2020-05-11 05:00 |
− | |end= | + | |end=2020-05-11 05:46 |
|mitigation=Reduce dependency on uplink | |mitigation=Reduce dependency on uplink | ||
|leader=Connor Sanders (CS) | |leader=Connor Sanders (CS) | ||
− | |others=Isaac Lowe (IL) | + | |others=Isaac Lowe (IL), Marks Polakovs (MP) |
}} | }} | ||
− | |||
− | |||
== Summary == | == Summary == | ||
Line 24: | Line 22: | ||
* Reduce dependency on upstream services | * Reduce dependency on upstream services | ||
:* Investigate a local caching DNS resolver? | :* Investigate a local caching DNS resolver? | ||
− | * Ask ITS | + | :* '''MP - done-ish, running unbound on uryfw0 and many (but not all boxes use it)''' |
− | * Ask ITS | + | * Ask ITS nicely to tell us when they take down our campus uplink |
+ | :* '''MP - done''' | ||
+ | * Ask ITS nicely to make it reboot at xx:30 instead of xx:00 | ||
+ | :* '''MP - done''' | ||
* Improve documentation and logging of the new WebStudio services, to make future troubleshooting easier | * Improve documentation and logging of the new WebStudio services, to make future troubleshooting easier | ||
+ | * Figure out why Dearie-Me didn't fire - possibly needs a recalibrate | ||
* Reduce log spamminess of Dearie-Me, it filled up its journald buffer quite quickly | * Reduce log spamminess of Dearie-Me, it filled up its journald buffer quite quickly | ||
+ | |||
+ | == Timings == | ||
+ | |||
+ | HH:MM:SS | ||
+ | Dead air start: 05:02:06.500 | ||
+ | Dead air end: 05:45:42.000 | ||
+ | TOTAL: 00:43:35.500 | ||
[[Category:Incident Reports]] | [[Category:Incident Reports]] |
Latest revision as of 09:47, 7 July 2020
Incident Report | |
---|---|
Switch reboot took down DNS, which brooke'd selector | |
Summary | |
Severity | Moderate |
Impact | High (Dead air for around 45 minutes) |
Event Start | 2020-05-11 05:00 |
Event End | 2020-05-11 05:46 |
Recurrence Mitigation | Reduce dependency on uplink |
Contacts | |
Recovery Leader | Connor Sanders (CS) |
Other Attendees | Isaac Lowe (IL), Marks Polakovs (MP) |
Summary
At 05:00:00, our IT Services uplink switch, urysw4, rebooted for a regularly scheduled update (that nobody was aware of because we weren't on the ITS comms list).
AutoSwitcher had started dutifully doing the news at 04:59:45, and was preparing to finish doing the news at 05:02:00. It tried to switch back from WebStudio (the news is layered over a silent WebStudio source... don't ask) to Jukebox, but it found that, since our uplink was down, it couldn't reach the campus DNS servers, thus couldn't resolve selector.york.ac.uk, and thus couldn't switch back.
That left us with an empty WebStudio source broadcasting. Liquidsoap detected the silence (and sent a rather beautiful "Source 0 was on air" silence email), but couldn't switch back to Jukebox for the same reason. The switch finished rebooting at 05:04:30, but we were stuck on dead air. Dearie-Me, for some inexplicable reason, didn't fire until 5:10 (presumably static was keeping it from hitting the threshold), and CS woke up and saw the alerts at 05:37, before switching back to Jukebox at 05:46 and investigating with the assistance of IL.
Reoccurrence mitigation
- Reduce dependency on upstream services
- Investigate a local caching DNS resolver?
- MP - done-ish, running unbound on uryfw0 and many (but not all boxes use it)
- Ask ITS nicely to tell us when they take down our campus uplink
- MP - done
- Ask ITS nicely to make it reboot at xx:30 instead of xx:00
- MP - done
- Improve documentation and logging of the new WebStudio services, to make future troubleshooting easier
- Figure out why Dearie-Me didn't fire - possibly needs a recalibrate
- Reduce log spamminess of Dearie-Me, it filled up its journald buffer quite quickly
Timings
HH:MM:SS Dead air start: 05:02:06.500 Dead air end: 05:45:42.000 TOTAL: 00:43:35.500