Incident Report: 20200627: Difference between revisions

From URY Wiki
Jump to navigation Jump to search
Created page with "{{Incident |brief=uryups0 forgot what "uninterruptible power" means |severity=High |impact=Medium (Total loss of computing services for 2 hours - but we were off air)..."
 
No edit summary
Line 38: Line 38:
Aftermath:
Aftermath:
* pool0/backup needed to be manually mounted on bsod - intuitively enough, at /mnt/pool0, not /mnt/pool0/backup.
* pool0/backup needed to be manually mounted on bsod - intuitively enough, at /mnt/pool0, not /mnt/pool0/backup.
* IL gave uryfw0's raid SDRAM a good kick, and it seems to work fine-ish. We still need to procure a spare, and it's not reliable enough to be our router so urystv is still doing that.


== Root Cause Analysis ==
== Root Cause Analysis ==
Line 60: Line 61:


* Replace freshping with something better - MP done with uptimerobot
* Replace freshping with something better - MP done with uptimerobot
* Get a spare stick of raid RAM

Revision as of 09:50, 22 July 2020

Incident Report
uryups0 forgot what "uninterruptible power" means
Summary
Severity High
Impact Medium (Total loss of computing services for 2 hours - but we were off air)
Event Start 2020-06-27 13:59
Event End 2020-06-27
Recurrence Mitigation Improve monitoring
Contacts
Recovery Leader Marks Polakovs (MP)
Other Attendees Michael Grace (MG), Matthew Stratford (MS), Isaac Lowe (IL), Harry Smith (HS), Alice Milburn (AVM), Jacob Dicker (JD)


Chronicle of Events

(All times BST)

At 13:58:29 the Stores UPS logged “UPS: The output power is turned off.” For, seemingly, no apparent reason. It tried to send us emails to warn us, but, considering it had just turned off power to the email server, that didn’t go well. At this point all of URY was down. OUTAGE BEGINS.

At 14:01 HS asked in Slack “Is it just me or is the website down?”, and at 14:03 MG confirmed with an @channel that we had dropped off completely. INCIDENT BEGINS.

14:03:30 the UPS logged “UPS: The output power is turned on.” Great, except that all (read: most) servers were powered off, and weren’t set to power on boot (rightly so). Note that at this point we weren’t aware of a power problem, we presumed it was a janky uryfw0 ethernet cable again. AVM was closest to URY at the time, so she was sent in to investigate.

She arrived at around 14:30, checked the physical links in The Hub, all looked normal. She tried reseating fw0’s ethernet cables (unaware that it was completely off). At 14:42 she tried power-cycling fw0, and at 14:45 she reported that it was displaying the fateful error message: “RAID Adapter Memory Error!!!” (the exclamation marks are really part of the error). Sadly, she had to leave now to do Boring Things in The Real World. HS was next closest, so he was dispatched.

While the motley crew on the Zoom call (read: MP, MG, and MS) waited for HTS to arrive, they started discussing plans. They realise that urystv has near-identical hardware, so the plan becomes to swap uryfw0 and urystv’s hard disks - in effect making urystv the primary router.

HS arrives at 15:05. He reports that roughly half the servers were powered off (namely: uryfw0 (before AVM turned it on), urystv, urybackup0, ?). He gets to work on swapping the drives. This is finished at around 15:40 - HS tries importing the RAID array on urystv’s onboard adapter and booting it up. Has some issues, some related to the boot order, but gets it booted up at around 16:10ish. At this point we now have network access, and the gang get to work powering the other servers back up. OUTAGE ENDS-ish.

At 16:25 MP checks the UPS logs, and spots the errors from earlier. He is very confused. So is JD. The job of a UPS is normally to provide uninterruptible power, and today it did exactly the opposite of that, for essentially no reason. At JD’s suggestion MP runs a self-test and it passes.

At 16:54 at MS’ suggestion MP ran a full runtime calibration on the UPS - this also passed fine, although it still stubbornly reports a runtime of five minutes.

Server recovery:

  • ury - the website came back at 16:38. Somewhat hilariously, Freshping reported that the website is down at 16:39, having somehow missed the previous three hours of downtime and deciding that up is down and down is up. This made MP very angry and he killed freshping.
  • urybsod - unsurprisingly it reported pending sectors and went into single-user mode. HS ran a fsck and it booted fine… or did it? More on that later.
  • urybackup0 - it had tres fun. As MS put it, “backup0 more like hiccup0.” I don’t actually remember much of this. Someone remind me to rewatch the Zoom recording.
  • One of the loggers’ loggerng service didn’t start up properly for some reason. MP started it manually.

Aftermath:

  • pool0/backup needed to be manually mounted on bsod - intuitively enough, at /mnt/pool0, not /mnt/pool0/backup.
  • IL gave uryfw0's raid SDRAM a good kick, and it seems to work fine-ish. We still need to procure a spare, and it's not reliable enough to be our router so urystv is still doing that.

Root Cause Analysis

Why did we drop off the internet? Because uryfw0 lost power.

Why did uryfw0 lose power? Because uryups0 turned it off.

Why did uryups0 turn off power? Nobody knows.

Why did nobody notice? Well, they did - and also, Freshping is awful.

Why did it take us so long to come back? Because nobody was on-site at the time, and because the servers didn’t boot back up immediately.

Why was nobody onsite? Because rona.

Why did the servers not boot back up? Because they weren’t set to boot on power.

Why weren’t they set to boot on power? Because the rush of power may trip the breaker. Sensible.

Post-Recovery Actions

  • Replace freshping with something better - MP done with uptimerobot
  • Get a spare stick of raid RAM