Incident Report: 20131201

From URY Wiki
Jump to navigation Jump to search

The below email was a post-mortem of a Campus North power outage at 01:42am, Sunday, 1st December 2013. Assistant Head of Computing Anthony Williams and Live-Next-To-The-Station-Man Lloyd Wallis were on site to respond. This is how it went.

Incident Report
A displeased 12kV power cable shorted, taking out parts of the local power grid. URY's servers were unhappy with this series of events, so massaging was needed.
Summary
Severity Critical
Impact Medium (main outage at night, residual issues were for minor services)
Event Start 01/12/2013 01:42
Event End 01/12/2013 07:11
Recurrence Mitigation Steps taken, but further action needed.
Contacts
Recovery Leader Anthony Williams <anthony@ury.org.uk>
Other Attendees Lloyd Wallis <lpw@ury.org.uk>


Hey everyone,

tl;dr: There was a power cut. We turned everything off and back on again. It’s fine now.

As promised, this is a full breakdown of how the power outage on Sunday affected our systems and how we responded. You know, so we can learn from it and do better next time.

This outage came as a surprise, but a well overdue one. Usually, we expect to experience at a rate of about once a term, and we went the entirety of the last academic year without any environment issues. Naturally, this started shortly after a conversation about how we hadn’t had a power cut for some time.

Event Start: 01:42

At first, we didn’t think there was a substantial issue - the power in James college was knocked out for about a second, but their UPS coped so we assumed ours would too. Some amusement was poked at YSTV for neglecting to put their web/db server on said UPS. Once we confirmed our website was inaccessible, we made our way over to investigate, but assumed v1rs0 was just rebooting.

We reached URY in a state of running on UPS power, so started a graceful shutdown of all servers using ACPI (i.e. the power button). We left logger2 running at this stage because of the uptime, and logger2 is a hardy-ish device that’s seen a lot anyway. Everything but that was powered down by 01:52.

UPS Failure: 02:13

At this point, logger2 had exhausted the UPS and was powered off. At this point, the transmitter was also powered off as we would not be logging when power is restored. Security have confirmed all emergency staff have been called in, but the interesting arrangement of areas without power meant it would likely be a while.

The patch bay UPS running jukebox and the switches lasted a bit longer but turned off a little while after. Of note is that for the majority of the time, both UPS units were on ‘1 flashing bar’ of power, suggesting they needed calibrating again.

We spend the time wandering around campus in the dark and drinking hot chocolate in YSTV, living off the few drips of Internet Lloyd’s 3G tethering could provide, and the rare view of stars due to the reduced light pollution.

Power Restored: 05:42

We notice a flicker from YSTV’s lights. We have a wander over and see power is back. We also learn the method of ‘looking out of YSTV’s window to see if the physics bridge is lit’ doesn’t work well. We rejoice.

Problem #1: Stores Distribution Board

The B16 breaker that estates told me they would upgrade to C16 last year tripped from the sudden desire of a large quantity of batteries to charge up. Turning the power back on caused a happy bright flash, some dimmed lights, then the breaker re-tripping.

Luckily, switching off the compressors and distribution amplifier for the audio path was sufficient, and the UPS then began to charge at 05:44, and enabled server power at 05:49.

Problem #2: logger2

As is traditional, logger2 decided not to play ball when we powered it up. It made strange noises and smelled a bit, so we left it off overnight. Confirmation that logger1 and uryblue were operating left us happy to power the transmitter on by 06:30. It did start up later in the day, so was probably just being fussy because it was a bit cold.

Problem #3: IT Services Systems

Other servers experienced delays coming up due to reliance on the campus DNS and NTP servers, but once running worked as expected. Core IT Services Systems came up automatically shortly after power was restored.

Problem #4: Music Store Mounting

Partly Lloyd’s fault from playing with drbd way back when, /etc/fstab refused to agree with the fact that /dev/sdb1 is an ext4 filesystem and would not mount. Mounting it manually as ext4 worked. This is a manual step that needs removing, however.

At this stage, we verified that essential services - the digital audio path, member web services, loggers and BAPS - were all working sufficiently, and declared our state of emergency at an end.

The only remaining immediate issue at hand was our Outside Broadcast scheduled to start at 10am. A contingency plan was devised and communicated in the event that IT Services had not restored NAS by this time. Unfortunately, the on-site engineers for the OB assumed that the wired network was still unavailable because the OB machines did not detect a network connection, when, in fact, they had used the wrong network socket.

Event End: 07:11

Other Points of Note

After the initial restoration of services, the following days raised several issues that required manual intervention to fully restore service:

  • When the power failed in the IT Services Berrick Saul Datacenter, our Virtual Machine, uryrrod, was automatically powered on in the TFTA Datacenter. However, due to it having no network access and an NFS mount in its fstab, the server was found to be sitting in Single-User Mode waiting for administrator intervention when it was next checked on Sunday evening.
  • The Stores UPS (the main backup supply for our servers), has now successfully acknowledged that it has 6 minutes of runtime, not 3-4 (of course, we know from the experience of this event it is easily 15).
  • When the scheduled backup processes ran in the early hours of Monday morning, the backup system failed to detect that the network mountpoints were not set up on uryfs1. The system started to back up to its own root filesystem until all space was consumed, at which point services started to fail overnight. Luckily, Icecast was not an affected service, and successfully fell back to a static jukebox feed when it detected that Liquidsoap had stopped responding. This issue was not noticed until mid-afternoon on Monday when a technical member entered the station to see the Studio Clock reporting that the server was out of space (it turns out the studio clock does that now).
  • The ZFS pools on urybackup0 were not automatically mounted on boot. This was also not noticed until attempts to resolve the uryfs1 backup issue were made. Our primary archive filestore was therefore unavailable to members for most of the day.

Changes Made

  • The monitor in the server cupboard has been moved from mains power to UPS power - this is needed to power off the loggers gracefully as they have AT power supplies. The next cable tidyup in that rack should remove the excess mains cable.
  • uryfs1 will now detect missing backup mounts and attempt to restore them.
  • Remove the need to enter console passwords on boot for ury and uryfs1
    • Decrypt the SSL certificates and review permissions on their storage directory

Changes to be Made

  • Investigate Stores Distribution Board upgrade
    • Re-raise the issue with Estates and YUSU
  • Prevent uryfs1 and other servers from backing up if the mount fails to be brought up
    • Anyone want to practice their bash-fu?
  • Ensure urybackup0 mounts /pool0 and /pool1 on boot
    • Currently need to run zfs mount pool0 && zfs mount pool1
  • Ensure uryfs1 mounts /music on boot
    • Server seems to get its filesystem types confused
  • Remember to power uryfw0 up first
    • Much spinning waiting on NTP etc otherwise (partly not our fault)
  • Add checking backup mounts, uryrrod to our standard boot procedure, plus generally update the document
  • Review policy for communication on social media during a failure
    • Should named Technical persons have access for these scenarios?
  • Review policy for calling named persons at 2am (this kind of thing normally happens during the day)
    • Would Al have rather been woken up than not know?
  • Improve documentation on Outside Broadcast systems and standard contingency plans if 802.1x is unavailable on the campus network
    • Known non-NAS/opfa ports, IP addresses available to us
    • Give engineering the learnings of ‘how to tell what a port is’?
    • We should generally have a how-to guide for the freshers
  • Review whether the stores distribution board should be touched (I mean, look at it)
    • Talk to Estates/Health and Safety? Last time I did I got a “What the f***” from the electrician
  • Investigate automatic shutdown options?
    • We now know the switch/jukebox UPS lasts a good length of time so the limiting factor of very short comms runtime (~90s) last time this was considered is no longer a problem.

Hopefully this’ll be of some use in the future.

Lloyd & Anthony

Were-there-when-it-happened Officers