Incident Report: 20200627 and Shutting Down URY In A Hurry: Difference between pages

From URY Wiki
(Difference between pages)
Jump to navigation Jump to search
Created page with "{{Incident |brief=uryups0 forgot what "uninterruptible power" means |severity=High |impact=Medium (Total loss of computing services for 2 hours - but we were off air)..."
 
 
Line 1: Line 1:
{{Incident
''This is a ops-critical document. A printed copy is available in the Server Cupboard and should be updated whenever this online version is.''
  |brief=uryups0 forgot what "uninterruptible power" means
  |severity=High
  |impact=Medium (Total loss of computing services for 2 hours - but we were off air)
  |start=2020-06-27 13:59
  |end=2020-06-27
  |mitigation=Improve monitoring
  |leader=Marks Polakovs (MP)
  |others=Michael Grace (MG), Matthew Stratford (MS), Isaac Lowe (IL), Harry Smith (HS), Alice Milburn (AVM), Jacob Dicker (JD)
}}


== Chronicle of Events ==
Turn servers off in this order, waiting a few seconds between each button:


''(All times BST)''
* urystv [no need to wait]
* ury (thunderhorn)
* dolby
* urybsod
* urysteve
* urybackup0
* uryfw0
* transmitter, uryblue, uryred [call engineering now]
* uryrrod [VMWare]


At 13:58:29 the Stores UPS logged “UPS: The output power is turned off.” For, seemingly, no apparent reason. It tried to send us emails to warn us, but, considering it had just turned off power to the email server, that didn’t go well. At this point all of URY was down. '''OUTAGE BEGINS.'''
''Note: The KVM is powered by a 12V brick, so can’t go on UPS power. So you need to move the monitor cable between each server if one seems to be having trouble going down. The keyboard should still pass through using power gleaned from the PS/2 ports, if you want to risk that.'' [not sure if this is still a thing?]


At 14:01 HS asked in Slack “Is it just me or is the website down?”, and at 14:03 MG confirmed with an @channel that we had dropped off completely. '''INCIDENT BEGINS.'''
A big factor in delays on powering down is hanging waiting on NFS/SMB - problematic if you’ve shut down whatever was providing the mount, so stick to this order.


14:03:30 the UPS logged “UPS: The output power is turned on.” Great, except that all (read: most) servers were powered off, and weren’t set to power on boot (rightly so). Note that at this point we weren’t aware of a power problem, we presumed it was a janky uryfw0 ethernet cable again. AVM was closest to URY at the time, so she was sent in to investigate.
As early as possible during this process, try to reach one of: Station Manager; Assistant Station Manager; Programme Controller to inform them of the service outage so they can invoke necessary social media routes.


She arrived at around 14:30, checked the physical links in The Hub, all looked normal. She tried reseating fw0’s ethernet cables (unaware that it was completely off). At 14:42 she tried power-cycling fw0, and at 14:45 she reported that it was displaying the fateful error message: “RAID Adapter Memory Error!!!” (the exclamation marks are really part of the error). Sadly, she had to leave now to do Boring Things in The Real World. HS was next closest, so he was dispatched.
Remember: Once these servers are off, sending emails to @ury.org.uk email accounts doesn't work! Use Slack, @york.ac.uk addresses, Facebook or phone numbers.


While the motley crew on the Zoom call (read: MP, MG, and MS) waited for HTS to arrive, they started discussing plans. They realise that urystv has near-identical hardware, so the plan becomes to swap uryfw0 and urystv’s hard disks - in effect making urystv the primary router.
You now won't have much to do until the power's back on, most likely. Using a manual writing implement, make note of how the procedure went in preparation for [[Cold-Starting URY Systems]] later on.


HS arrives at 15:05. He reports that roughly half the servers were powered off (namely: uryfw0 (before AVM turned it on), urystv, urybackup0, ?). He gets to work on swapping the drives. This is finished at around 15:40 - HS tries importing the RAID array on urystv’s onboard adapter and booting it up. Has some issues, some related to the boot order, but gets it booted up at around 16:10ish. At this point we now have network access, and the gang get to work powering the other servers back up. '''OUTAGE ENDS-ish.'''
== Rationale ==
* urystv has no critical mounts, so it can be a quick way to shed some load
* ury goes after that since it doesn’t have any mounts elsewhere
* dolby after that, because of postgres
* urybsod has some exports pertaining to log generation, namely to uryrrod and ury
* urysteve now, because of /music
* urybackup0 now because urysteve backs up to it. [If the UPS is absolutely screaming about low battery, you can risk taking this down first as it does draw the most power - still accurate?]
* uryfw0 after all that -- would be handy to still have comms if servers need to cross networks (unmounting loggers)
* The loggers, in no particular order.
* The transmitter must be turned off if the loggers are powered down, and especially if the UPS power fails altogether, due to lack of logging capability, a legal requirement. Call engineering to let them know this has happened.
* uryrrod mounts urybsod for mixclouder and urybackup0 for webcams, so it may be unhappy if it's unmounted - this is not critical though


At 16:25 MP checks the UPS logs, and spots the errors from earlier. He is very confused. So is JD. The job of a UPS is normally to provide uninterruptible power, and today it did exactly the opposite of that, for essentially no reason. At JD’s suggestion MP runs a self-test and it passes.
[[Category:Technical How-Tos]]
 
At 16:54 at MS’ suggestion MP ran a full runtime calibration on the UPS - this also passed fine, although it still stubbornly reports a runtime of five minutes.
 
Server recovery:
* ury - the website came back at 16:38. Somewhat hilariously, Freshping reported that the website is down at 16:39, having somehow missed the previous three hours of downtime and deciding that up is down and down is up. This made MP very angry and he killed freshping.
* urybsod - unsurprisingly it reported pending sectors and went into single-user mode. HS ran a fsck and it booted fine… or did it? More on that later.
* urybackup0 - it had tres fun. As MS put it, “backup0 more like hiccup0.” I don’t actually remember much of this. Someone remind me to rewatch the Zoom recording.
* One of the loggers’ loggerng service didn’t start up properly for some reason. MP started it manually.
 
Aftermath:
* pool0/backup needed to be manually mounted on bsod - intuitively enough, at /mnt/pool0, not /mnt/pool0/backup.
 
== Root Cause Analysis ==
 
Why did we drop off the internet? Because uryfw0 lost power.
 
Why did uryfw0 lose power? Because uryups0 turned it off.
 
Why did uryups0 turn off power? Nobody knows.
 
Why did nobody notice? Well, they did - and also, Freshping is awful.
 
Why did it take us so long to come back? Because nobody was on-site at the time, and because the servers didn’t boot back up immediately.
 
Why was nobody onsite? Because rona.
 
Why did the servers not boot back up? Because they weren’t set to boot on power.
 
Why weren’t they set to boot on power? Because the rush of power may trip the breaker. Sensible.
 
== Post-Recovery Actions ==
 
* Replace freshping with something better - MP done with uptimerobot

Latest revision as of 12:23, 23 July 2020

This is a ops-critical document. A printed copy is available in the Server Cupboard and should be updated whenever this online version is.

Turn servers off in this order, waiting a few seconds between each button:

  • urystv [no need to wait]
  • ury (thunderhorn)
  • dolby
  • urybsod
  • urysteve
  • urybackup0
  • uryfw0
  • transmitter, uryblue, uryred [call engineering now]
  • uryrrod [VMWare]

Note: The KVM is powered by a 12V brick, so can’t go on UPS power. So you need to move the monitor cable between each server if one seems to be having trouble going down. The keyboard should still pass through using power gleaned from the PS/2 ports, if you want to risk that. [not sure if this is still a thing?]

A big factor in delays on powering down is hanging waiting on NFS/SMB - problematic if you’ve shut down whatever was providing the mount, so stick to this order.

As early as possible during this process, try to reach one of: Station Manager; Assistant Station Manager; Programme Controller to inform them of the service outage so they can invoke necessary social media routes.

Remember: Once these servers are off, sending emails to @ury.org.uk email accounts doesn't work! Use Slack, @york.ac.uk addresses, Facebook or phone numbers.

You now won't have much to do until the power's back on, most likely. Using a manual writing implement, make note of how the procedure went in preparation for Cold-Starting URY Systems later on.

Rationale

  • urystv has no critical mounts, so it can be a quick way to shed some load
  • ury goes after that since it doesn’t have any mounts elsewhere
  • dolby after that, because of postgres
  • urybsod has some exports pertaining to log generation, namely to uryrrod and ury
  • urysteve now, because of /music
  • urybackup0 now because urysteve backs up to it. [If the UPS is absolutely screaming about low battery, you can risk taking this down first as it does draw the most power - still accurate?]
  • uryfw0 after all that -- would be handy to still have comms if servers need to cross networks (unmounting loggers)
  • The loggers, in no particular order.
  • The transmitter must be turned off if the loggers are powered down, and especially if the UPS power fails altogether, due to lack of logging capability, a legal requirement. Call engineering to let them know this has happened.
  • uryrrod mounts urybsod for mixclouder and urybackup0 for webcams, so it may be unhappy if it's unmounted - this is not critical though