Changes

no edit summary
Line 1: Line 1:  
'''''The below email was a post-mortem of a Campus North power outage at 01:42am, Sunday, 1st December 2013. Assistant Head of Computing Anthony Williams and Live-Next-To-The-Station-Man Lloyd Wallis were on site to respond. This is how it went.'''''
 
'''''The below email was a post-mortem of a Campus North power outage at 01:42am, Sunday, 1st December 2013. Assistant Head of Computing Anthony Williams and Live-Next-To-The-Station-Man Lloyd Wallis were on site to respond. This is how it went.'''''
    +
== Introduction ==
 
Hey everyone,
 
Hey everyone,
   −
tl;dr: There was a power cut. We turned everything off and back on again. It’s fine now.
+
'''tl;dr: There was a power cut. We turned everything off and back on again. It’s fine now.'''
    
As promised, this is a full breakdown of how the power outage on Sunday affected our systems and how we responded. You know, so we can learn from it and do better next time.
 
As promised, this is a full breakdown of how the power outage on Sunday affected our systems and how we responded. You know, so we can learn from it and do better next time.
Line 9: Line 10:  
This outage came as a surprise, but a well overdue one. Usually, we expect to experience at a rate of about once a term, and we went the entirety of the last academic year without any environment issues. Naturally, this started shortly after a conversation about how we hadn’t had a power cut for some time.
 
This outage came as a surprise, but a well overdue one. Usually, we expect to experience at a rate of about once a term, and we went the entirety of the last academic year without any environment issues. Naturally, this started shortly after a conversation about how we hadn’t had a power cut for some time.
   −
Event Start: 01:42
+
== Event Start: 01:42 ==
 
   
At first, we didn’t think there was a substantial issue - the power in James college was knocked out for about a second, but their UPS coped so we assumed ours would too. Some amusement was poked at YSTV for neglecting to put their web/db server on said UPS. Once we confirmed our website was inaccessible, we made our way over to investigate, but assumed v1rs0 was just rebooting.
 
At first, we didn’t think there was a substantial issue - the power in James college was knocked out for about a second, but their UPS coped so we assumed ours would too. Some amusement was poked at YSTV for neglecting to put their web/db server on said UPS. Once we confirmed our website was inaccessible, we made our way over to investigate, but assumed v1rs0 was just rebooting.
    
We reached URY in a state of running on UPS power, so started a graceful shutdown of all servers using ACPI (i.e. the power button). We left logger2 running at this stage because of the uptime, and logger2 is a hardy-ish device that’s seen a lot anyway. Everything but that was powered down by 01:52.
 
We reached URY in a state of running on UPS power, so started a graceful shutdown of all servers using ACPI (i.e. the power button). We left logger2 running at this stage because of the uptime, and logger2 is a hardy-ish device that’s seen a lot anyway. Everything but that was powered down by 01:52.
   −
UPS Failure: 02:13
+
== UPS Failure: 02:13 ==
 
At this point, logger2 had exhausted the UPS and was powered off. At this point, the transmitter was also powered off as we would not be logging when power is restored. Security have confirmed all emergency staff have been called in, but the interesting arrangement of areas without power meant it would likely be a while.
 
At this point, logger2 had exhausted the UPS and was powered off. At this point, the transmitter was also powered off as we would not be logging when power is restored. Security have confirmed all emergency staff have been called in, but the interesting arrangement of areas without power meant it would likely be a while.
   Line 22: Line 22:  
We spend the time wandering around campus in the dark and drinking hot chocolate in YSTV, living off the few drips of Internet Lloyd’s 3G tethering could provide, and the rare view of stars due to the reduced light pollution.
 
We spend the time wandering around campus in the dark and drinking hot chocolate in YSTV, living off the few drips of Internet Lloyd’s 3G tethering could provide, and the rare view of stars due to the reduced light pollution.
   −
Power Restored: 05:42
+
== Power Restored: 05:42 ==
 
   
We notice a flicker from YSTV’s lights. We have a wander over and see power is back. We also learn the method of ‘looking out of YSTV’s window to see if the physics bridge is lit’ doesn’t work well. We rejoice.
 
We notice a flicker from YSTV’s lights. We have a wander over and see power is back. We also learn the method of ‘looking out of YSTV’s window to see if the physics bridge is lit’ doesn’t work well. We rejoice.
   −
Problem #1: Stores Distribution Board
+
== Problem #1: Stores Distribution Board ==
 
The B16 breaker that estates told me they would upgrade to C16 last year tripped from the sudden desire of a large quantity of batteries to charge up. Turning the power back on caused a happy bright flash, some dimmed lights, then the breaker re-tripping.
 
The B16 breaker that estates told me they would upgrade to C16 last year tripped from the sudden desire of a large quantity of batteries to charge up. Turning the power back on caused a happy bright flash, some dimmed lights, then the breaker re-tripping.
    
Luckily, switching off the compressors and distribution amplifier for the audio path was sufficient, and the UPS then began to charge at 05:44, and enabled server power at 05:49.
 
Luckily, switching off the compressors and distribution amplifier for the audio path was sufficient, and the UPS then began to charge at 05:44, and enabled server power at 05:49.
   −
Problem #2: logger2
+
== Problem #2: logger2 ==
 
As is traditional, logger2 decided not to play ball when we powered it up. It made strange noises and smelled a bit, so we left it off overnight. Confirmation that logger1 and uryblue were operating left us happy to power the transmitter on by 06:30. It did start up later in the day, so was probably just being fussy because it was a bit cold.
 
As is traditional, logger2 decided not to play ball when we powered it up. It made strange noises and smelled a bit, so we left it off overnight. Confirmation that logger1 and uryblue were operating left us happy to power the transmitter on by 06:30. It did start up later in the day, so was probably just being fussy because it was a bit cold.
   −
Problem #3: IT Services Systems
+
== Problem #3: IT Services Systems ==
 
Other servers experienced delays coming up due to reliance on the campus DNS and NTP servers, but once running worked as expected. Core IT Services Systems came up automatically shortly after power was restored.
 
Other servers experienced delays coming up due to reliance on the campus DNS and NTP servers, but once running worked as expected. Core IT Services Systems came up automatically shortly after power was restored.
   −
Problem #4: Music Store Mounting
+
== Problem #4: Music Store Mounting ==
 
Partly Lloyd’s fault from playing with drbd way back when, /etc/fstab refused to agree with the fact that /dev/sdb1 is an ext4 filesystem and would not mount. Mounting it manually as ext4 worked. This is a manual step that needs removing, however.
 
Partly Lloyd’s fault from playing with drbd way back when, /etc/fstab refused to agree with the fact that /dev/sdb1 is an ext4 filesystem and would not mount. Mounting it manually as ext4 worked. This is a manual step that needs removing, however.
   Line 44: Line 43:  
The only remaining immediate issue at hand was our Outside Broadcast scheduled to start at 10am. A contingency plan was devised and communicated in the event that IT Services had not restored NAS by this time. Unfortunately, the on-site engineers for the OB assumed that the wired network was still unavailable because the OB machines did not detect a network connection, when, in fact, they had used the wrong network socket.
 
The only remaining immediate issue at hand was our Outside Broadcast scheduled to start at 10am. A contingency plan was devised and communicated in the event that IT Services had not restored NAS by this time. Unfortunately, the on-site engineers for the OB assumed that the wired network was still unavailable because the OB machines did not detect a network connection, when, in fact, they had used the wrong network socket.
   −
Event End: 07:11
+
'''''Event End: 07:11'''''
   −
Other Points of Note
+
== Other Points of Note ==
 
After the initial restoration of services, the following days raised several issues that required manual intervention to fully restore service:
 
After the initial restoration of services, the following days raised several issues that required manual intervention to fully restore service:
 
When the power failed in the IT Services Berrick Saul Datacenter, our Virtual Machine, uryrrod, was automatically powered on in the TFTA Datacenter. However, due to it having no network access and an NFS mount in its fstab, the server was found to be sitting in Single-User Mode waiting for administrator intervention when it was next checked on Sunday evening.
 
When the power failed in the IT Services Berrick Saul Datacenter, our Virtual Machine, uryrrod, was automatically powered on in the TFTA Datacenter. However, due to it having no network access and an NFS mount in its fstab, the server was found to be sitting in Single-User Mode waiting for administrator intervention when it was next checked on Sunday evening.
Line 53: Line 52:  
The ZFS pools on urybackup0 were not automatically mounted on boot. This was also not noticed until attempts to resolve the uryfs1 backup issue were made. Our primary archive filestore was therefore unavailable to members for most of the day.
 
The ZFS pools on urybackup0 were not automatically mounted on boot. This was also not noticed until attempts to resolve the uryfs1 backup issue were made. Our primary archive filestore was therefore unavailable to members for most of the day.
   −
Changes Made
+
== Changes Made ==
 
The monitor in the server cupboard has been moved from mains power to UPS power - this is needed to power off the loggers gracefully as they have AT power supplies. The next cable tidyup in that rack should remove the excess mains cable.
 
The monitor in the server cupboard has been moved from mains power to UPS power - this is needed to power off the loggers gracefully as they have AT power supplies. The next cable tidyup in that rack should remove the excess mains cable.
 
uryfs1 will now detect missing backup mounts and attempt to restore them.
 
uryfs1 will now detect missing backup mounts and attempt to restore them.
Line 59: Line 58:  
Decrypt the SSL certificates and review permissions on their storage directory
 
Decrypt the SSL certificates and review permissions on their storage directory
   −
Changes to be Made
+
== Changes to be Made ==
Investigate Stores Distribution Board upgrade
+
* Investigate Stores Distribution Board upgrade
 
Re-raise the issue with Estates and YUSU
 
Re-raise the issue with Estates and YUSU
 
Prevent uryfs1 and other servers from backing up if the mount fails to be brought up
 
Prevent uryfs1 and other servers from backing up if the mount fails to be brought up