Incident Report: 20200511: Difference between revisions

No edit summary
No edit summary
 
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Incident
{{Incident
   |brief=Switch reboot took down DNS
   |brief=Switch reboot took down DNS, which brooke'd selector
   |severity=Moderate
   |severity=Moderate
   |impact=High (Dead air for around 45 minutes)
   |impact=High (Dead air for around 45 minutes)
   |start=25/02/2017 16:29
   |start=2020-05-11 05:00
   |end=25/02/2017 17:30
   |end=2020-05-11 05:46
   |mitigation=Reduce dependency on uplink
   |mitigation=Reduce dependency on uplink
   |leader=Connor Sanders (CS)
   |leader=Connor Sanders (CS)
   |others=Isaac Lowe (IL)
   |others=Isaac Lowe (IL), Marks Polakovs (MP)
}}
}}
(Total dead air: 05:02:02-05:46)


== Summary ==
== Summary ==
Line 24: Line 22:
* Reduce dependency on upstream services
* Reduce dependency on upstream services
:* Investigate a local caching DNS resolver?
:* Investigate a local caching DNS resolver?
* Ask ITS kindly to tell us when they take down our campus uplink
:* '''MP - done-ish, running unbound on uryfw0 and many (but not all boxes use it)'''
* Ask ITS kindly to make it reboot at xx:30 instead of xx:00
* Ask ITS nicely to tell us when they take down our campus uplink
:* '''MP - done'''
* Ask ITS nicely to make it reboot at xx:30 instead of xx:00
:* '''MP - done'''
* Improve documentation and logging of the new WebStudio services, to make future troubleshooting easier
* Improve documentation and logging of the new WebStudio services, to make future troubleshooting easier
* Figure out why Dearie-Me didn't fire - possibly needs a recalibrate
* Reduce log spamminess of Dearie-Me, it filled up its journald buffer quite quickly
* Reduce log spamminess of Dearie-Me, it filled up its journald buffer quite quickly
== Timings ==
                  HH:MM:SS
  Dead air start:  05:02:06.500
  Dead air end:    05:45:42.000
  TOTAL:          00:43:35.500


[[Category:Incident Reports]]
[[Category:Incident Reports]]