Difference between revisions of "Incident Report: 20230109"

From URY Wiki
Jump to navigation Jump to search
 
(3 intermediate revisions by 2 users not shown)
Line 2: Line 2:
 
   |brief=ITS replaced some networking gear and broke some routes. Sad panda.
 
   |brief=ITS replaced some networking gear and broke some routes. Sad panda.
 
   |severity=High
 
   |severity=High
   |impact=Medium (anything URY-related unavailable for around ~8 hours)
+
   |impact=High (anything URY-related unavailable for around ~8 hours)
 
   |start=2023-01-09 09:30
 
   |start=2023-01-09 09:30
 
   |end=2023-01-09 17:19
 
   |end=2023-01-09 17:19
Line 15: Line 15:
 
Early in the morning of the first day of the first week of term, IT Services (as we later found out) replaced some networking equipment in the Vanbrugh area, and may have incorrectly set some static routes. The first we found out about this is at 9:30, when all of our monitoring pinged that URY had dropped off the face of the internet.
 
Early in the morning of the first day of the first week of term, IT Services (as we later found out) replaced some networking equipment in the Vanbrugh area, and may have incorrectly set some static routes. The first we found out about this is at 9:30, when all of our monitoring pinged that URY had dropped off the face of the internet.
  
Much scrambling and some reboots later, we narrowed down the state of the network to:
+
Much scrambling and some (ultimately futile) reboots later, we narrowed down the state of the network to:
 
* Anything with a 144.32.64.160/27 IP (so all of URY) could send packets ''out'' of URY, but no packets would make it ''in''.
 
* Anything with a 144.32.64.160/27 IP (so all of URY) could send packets ''out'' of URY, but no packets would make it ''in''.
 
:* With the exception of uryfw0 which has a separate IP (144.32.109.64).
 
:* With the exception of uryfw0 which has a separate IP (144.32.109.64).
Line 24: Line 24:
 
::* You really don't want to know how... no, seriously, it's horrible...
 
::* You really don't want to know how... no, seriously, it's horrible...
  
Cue an ITS ticket, and at around 17:14 service was  restored.
+
An ITS ticket was filed at 12:12, and at around 17:14 service was  restored.
  
 
== Lessons Learned ==
 
== Lessons Learned ==
  
TODO
+
* We need to be more happy to assume, if we haven't touched the network and something's happened, it could be an ITS issue. Look at things like traceroutes earlier.
 +
* ITS don't really know about us - like they'd have just assumed uryfw0 is just 144.32.109.64 and not a gateway for all of 144.32.64.160/27, so be happy to remind them of this.
 +
* Phone ITS earlier if it's a big problem.
 +
* Don't get distracted by things that are merely the result of the problem - i.e. our DNS is often external to URY (i.e. wogan or ITS nameservers), so it can't resolve DNS. But it doesn't mean the key problem is DNS - the key problem is that no traffic reached us. What didn't help was the confusion about guest PC being able to access the internet - this is because it has a web proxy (because IRN).
 +
 
  
 
[[Category:Incident Reports]]
 
[[Category:Incident Reports]]

Latest revision as of 20:05, 11 January 2023

Incident Report
ITS replaced some networking gear and broke some routes. Sad panda.
Summary
Severity High
Impact High (anything URY-related unavailable for around ~8 hours)
Event Start 2023-01-09 09:30
Event End 2023-01-09 17:19
Recurrence Mitigation All necessary changes implemented
Contacts
Recovery Leader Joseph Sisson (JS)
Other Attendees Michael Grace (MG), Marks Polakovs (MP)


Chronicle of Events

(All times GMT)

Early in the morning of the first day of the first week of term, IT Services (as we later found out) replaced some networking equipment in the Vanbrugh area, and may have incorrectly set some static routes. The first we found out about this is at 9:30, when all of our monitoring pinged that URY had dropped off the face of the internet.

Much scrambling and some (ultimately futile) reboots later, we narrowed down the state of the network to:

  • Anything with a 144.32.64.160/27 IP (so all of URY) could send packets out of URY, but no packets would make it in.
  • With the exception of uryfw0 which has a separate IP (144.32.109.64).
  • Traceroutes showed the packets getting into a loop somewhere in Berrick Saul.
  • This meant that the website and online streams were down.
  • We could carry on broadcasting on AM, but FM failed
  • MP later diagnosed the FM issue to be a dependency on audio.ury.org.uk for the backup feed, and restored FM at 14:39.
  • You really don't want to know how... no, seriously, it's horrible...

An ITS ticket was filed at 12:12, and at around 17:14 service was restored.

Lessons Learned

  • We need to be more happy to assume, if we haven't touched the network and something's happened, it could be an ITS issue. Look at things like traceroutes earlier.
  • ITS don't really know about us - like they'd have just assumed uryfw0 is just 144.32.109.64 and not a gateway for all of 144.32.64.160/27, so be happy to remind them of this.
  • Phone ITS earlier if it's a big problem.
  • Don't get distracted by things that are merely the result of the problem - i.e. our DNS is often external to URY (i.e. wogan or ITS nameservers), so it can't resolve DNS. But it doesn't mean the key problem is DNS - the key problem is that no traffic reached us. What didn't help was the confusion about guest PC being able to access the internet - this is because it has a web proxy (because IRN).