Incident Report: 20200627: Difference between revisions

Created page with "{{Incident |brief=uryups0 forgot what "uninterruptible power" means |severity=High |impact=Medium (Total loss of computing services for 2 hours - but we were off air)..."
 
No edit summary
 
(3 intermediate revisions by the same user not shown)
Line 15: Line 15:


At 13:58:29 the Stores UPS logged “UPS: The output power is turned off.” For, seemingly, no apparent reason. It tried to send us emails to warn us, but, considering it had just turned off power to the email server, that didn’t go well. At this point all of URY was down. '''OUTAGE BEGINS.'''
At 13:58:29 the Stores UPS logged “UPS: The output power is turned off.” For, seemingly, no apparent reason. It tried to send us emails to warn us, but, considering it had just turned off power to the email server, that didn’t go well. At this point all of URY was down. '''OUTAGE BEGINS.'''
: Note: IL asked Danny to check his email (he uses a non-URY email for computing emails) for UPS alerts. None. Although that makes sense, as it had just killed the gateway it tried sending us the emails through.


At 14:01 HS asked in Slack “Is it just me or is the website down?”, and at 14:03 MG confirmed with an @channel that we had dropped off completely. '''INCIDENT BEGINS.'''
At 14:01 HS asked in Slack “Is it just me or is the website down?”, and at 14:03 MG confirmed with an @channel that we had dropped off completely. '''INCIDENT BEGINS.'''
Line 38: Line 40:
Aftermath:
Aftermath:
* pool0/backup needed to be manually mounted on bsod - intuitively enough, at /mnt/pool0, not /mnt/pool0/backup.
* pool0/backup needed to be manually mounted on bsod - intuitively enough, at /mnt/pool0, not /mnt/pool0/backup.
* IL gave uryfw0's raid SDRAM a good kick, and it seems to work fine-ish. We still need to procure a spare, and it's not reliable enough to be our router so urystv is still doing that.


== Root Cause Analysis ==
== Root Cause Analysis ==
Line 60: Line 63:


* Replace freshping with something better - MP done with uptimerobot
* Replace freshping with something better - MP done with uptimerobot
* Get a spare stick of raid RAM
* Figure out what the hell went wrong with the UPS - it shouldn't really be in the business of killing power willy-nilly
[[Category:Incident Reports]]