Changes

8,343 bytes added , 04:40, 3 December 2013

Created page with "Hey everyone, tl;dr: There was a power cut. We turned everything off and back on again. It’s fine now. As promised, this is a full breakdown of how the power outage on Sun..."

Hey everyone,

tl;dr: There was a power cut. We turned everything off and back on again. It’s fine now.

As promised, this is a full breakdown of how the power outage on Sunday affected our systems and how we responded. You know, so we can learn from it and do better next time.

This outage came as a surprise, but a well overdue one. Usually, we expect to experience at a rate of about once a term, and we went the entirety of the last academic year without any environment issues. Naturally, this started shortly after a conversation about how we hadn’t had a power cut for some time.

Event Start: 01:42

At first, we didn’t think there was a substantial issue - the power in James college was knocked out for about a second, but their UPS coped so we assumed ours would too. Some amusement was poked at YSTV for neglecting to put their web/db server on said UPS. Once we confirmed our website was inaccessible, we made our way over to investigate, but assumed v1rs0 was just rebooting.

We reached URY in a state of running on UPS power, so started a graceful shutdown of all servers using ACPI (i.e. the power button). We left logger2 running at this stage because of the uptime, and logger2 is a hardy-ish device that’s seen a lot anyway. Everything but that was powered down by 01:52.

UPS Failure: 02:13
At this point, logger2 had exhausted the UPS and was powered off. At this point, the transmitter was also powered off as we would not be logging when power is restored. Security have confirmed all emergency staff have been called in, but the interesting arrangement of areas without power meant it would likely be a while.

The patch bay UPS running jukebox and the switches lasted a bit longer but turned off a little while after. Of note is that for the majority of the time, both UPS units were on ‘1 flashing bar’ of power, suggesting they needed calibrating again.

We spend the time wandering around campus in the dark and drinking hot chocolate in YSTV, living off the few drips of Internet Lloyd’s 3G tethering could provide, and the rare view of stars due to the reduced light pollution.

Power Restored: 05:42

We notice a flicker from YSTV’s lights. We have a wander over and see power is back. We also learn the method of ‘looking out of YSTV’s window to see if the physics bridge is lit’ doesn’t work well. We rejoice.

Problem #1: Stores Distribution Board
The B16 breaker that estates told me they would upgrade to C16 last year tripped from the sudden desire of a large quantity of batteries to charge up. Turning the power back on caused a happy bright flash, some dimmed lights, then the breaker re-tripping.

Luckily, switching off the compressors and distribution amplifier for the audio path was sufficient, and the UPS then began to charge at 05:44, and enabled server power at 05:49.

Problem #2: logger2
As is traditional, logger2 decided not to play ball when we powered it up. It made strange noises and smelled a bit, so we left it off overnight. Confirmation that logger1 and uryblue were operating left us happy to power the transmitter on by 06:30. It did start up later in the day, so was probably just being fussy because it was a bit cold.

Problem #3: IT Services Systems
Other servers experienced delays coming up due to reliance on the campus DNS and NTP servers, but once running worked as expected. Core IT Services Systems came up automatically shortly after power was restored.

Problem #4: Music Store Mounting
Partly Lloyd’s fault from playing with drbd way back when, /etc/fstab refused to agree with the fact that /dev/sdb1 is an ext4 filesystem and would not mount. Mounting it manually as ext4 worked. This is a manual step that needs removing, however.

At this stage, we verified that essential services - the digital audio path, member web services, loggers and BAPS - were all working sufficiently, and declared our state of emergency at an end.

The only remaining immediate issue at hand was our Outside Broadcast scheduled to start at 10am. A contingency plan was devised and communicated in the event that IT Services had not restored NAS by this time. Unfortunately, the on-site engineers for the OB assumed that the wired network was still unavailable because the OB machines did not detect a network connection, when, in fact, they had used the wrong network socket.

Event End: 07:11

Other Points of Note
After the initial restoration of services, the following days raised several issues that required manual intervention to fully restore service:
When the power failed in the IT Services Berrick Saul Datacenter, our Virtual Machine, uryrrod, was automatically powered on in the TFTA Datacenter. However, due to it having no network access and an NFS mount in its fstab, the server was found to be sitting in Single-User Mode waiting for administrator intervention when it was next checked on Sunday evening.
The Stores UPS (the main backup supply for our servers), has now successfully acknowledged that it has 6 minutes of runtime, not 3-4 (of course, we know from the experience of this event it is easily 15).
When the scheduled backup processes ran in the early hours of Monday morning, the backup system failed to detect that the network mountpoints were not set up on uryfs1. The system started to back up to its own root filesystem until all space was consumed, at which point services started to fail overnight. Luckily, Icecast was not an affected service, and successfully fell back to a static jukebox feed when it detected that Liquidsoap had stopped responding. This issue was not noticed until mid-afternoon on Monday when a technical member entered the station to see the Studio Clock reporting that the server was out of space (it turns out the studio clock does that now).
The ZFS pools on urybackup0 were not automatically mounted on boot. This was also not noticed until attempts to resolve the uryfs1 backup issue were made. Our primary archive filestore was therefore unavailable to members for most of the day.

Changes Made
The monitor in the server cupboard has been moved from mains power to UPS power - this is needed to power off the loggers gracefully as they have AT power supplies. The next cable tidyup in that rack should remove the excess mains cable.
uryfs1 will now detect missing backup mounts and attempt to restore them.
Remove the need to enter console passwords on boot for ury and uryfs1
Decrypt the SSL certificates and review permissions on their storage directory

Changes to be Made
Investigate Stores Distribution Board upgrade
Re-raise the issue with Estates and YUSU
Prevent uryfs1 and other servers from backing up if the mount fails to be brought up
Anyone want to practice their bash-fu?
Ensure urybackup0 mounts /pool0 and /pool1 on boot
Currently need to run zfs mount pool0 && zfs mount pool1
Ensure uryfs1 mounts /music on boot
Server seems to get its filesystem types confused
Remember to power uryfw0 up first
Much spinning waiting on NTP etc otherwise (partly not our fault)
Add checking backup mounts, uryrrod to our standard boot procedure, plus generally update the document
https://docs.google.com/document/d/12gdrkNWPqC0hc0sJ1ETqM9TXCwmQy3ZBma1anc8ZO7M/edit
Review policy for communication on social media during a failure
Should named Technical persons have access for these scenarios?
Review policy for calling named persons at 2am (this kind of thing normally happens during the day)
Would Al have rather been woken up than not know?
Improve documentation on Outside Broadcast systems and standard contingency plans if 802.1x is unavailable on the campus network
Known non-NAS/opfa ports, IP addresses available to us
Give engineering the learnings of ‘how to tell what a port is’?
We should generally have a how-to guide for the freshers
Review whether the stores distribution board should be touched (I mean, look at it)
Talk to Estates/Health and Safety? Last time I did I got a “What the f***” from the electrician
Investigate automatic shutdown options?
We now know the switch/jukebox UPS lasts a good length of time so the limiting factor of very short comms runtime (~90s) last time this was considered is no longer a problem.

Hopefully this’ll be of some use in the future.
Lloyd & Anthony
Were-there-when-it-happened Officers

[[Category:Incident Reports]]

Lloyd Wallis (7449)

290

edits

Changes

Incident Report: 20131201 (view source)

Revision as of 04:40, 3 December 2013

Navigation menu

Search