Changes

Undo revision 242 by 7746 (talk)
Line 1: Line 1:  
'''''The below email was a post-mortem of a Campus North power outage at 01:42am, Sunday, 1st December 2013. Assistant Head of Computing Anthony Williams and Live-Next-To-The-Station-Man Lloyd Wallis were on site to respond. This is how it went.'''''
 
'''''The below email was a post-mortem of a Campus North power outage at 01:42am, Sunday, 1st December 2013. Assistant Head of Computing Anthony Williams and Live-Next-To-The-Station-Man Lloyd Wallis were on site to respond. This is how it went.'''''
   −
== Introduction ==
+
{{Incident
 +
  |brief=A displeased 12kV power cable shorted, taking out parts of the local power grid. URY's servers were unhappy with this series of events, so massaging was needed.
 +
  |severity=Critical
 +
  |impact=Medium (main outage at night, residual issues were for minor services)
 +
  |start=01/12/2013 01:42
 +
  |end=01/12/2013 07:11
 +
  |mitigation=Steps taken, but further action needed.
 +
  |leader=[[Anthony Williams]] <anthony@ury.org.uk>
 +
  |others=[[Lloyd Wallis]] <lpw@ury.org.uk>
 +
}}
 +
 
 
Hey everyone,
 
Hey everyone,
   Line 47: Line 57:  
== Other Points of Note ==
 
== Other Points of Note ==
 
After the initial restoration of services, the following days raised several issues that required manual intervention to fully restore service:
 
After the initial restoration of services, the following days raised several issues that required manual intervention to fully restore service:
When the power failed in the IT Services Berrick Saul Datacenter, our Virtual Machine, uryrrod, was automatically powered on in the TFTA Datacenter. However, due to it having no network access and an NFS mount in its fstab, the server was found to be sitting in Single-User Mode waiting for administrator intervention when it was next checked on Sunday evening.
+
* When the power failed in the IT Services Berrick Saul Datacenter, our Virtual Machine, uryrrod, was automatically powered on in the TFTA Datacenter. However, due to it having no network access and an NFS mount in its fstab, the server was found to be sitting in Single-User Mode waiting for administrator intervention when it was next checked on Sunday evening.
The Stores UPS (the main backup supply for our servers), has now successfully acknowledged that it has 6 minutes of runtime, not 3-4 (of course, we know from the experience of this event it is easily 15).
+
* The Stores UPS (the main backup supply for our servers), has now successfully acknowledged that it has 6 minutes of runtime, not 3-4 (of course, we know from the experience of this event it is easily 15).
When the scheduled backup processes ran in the early hours of Monday morning, the backup system failed to detect that the network mountpoints were not set up on uryfs1. The system started to back up to its own root filesystem until all space was consumed, at which point services started to fail overnight. Luckily, Icecast was not an affected service, and successfully fell back to a static jukebox feed when it detected that Liquidsoap had stopped responding. This issue was not noticed until mid-afternoon on Monday when a technical member entered the station to see the Studio Clock reporting that the server was out of space (it turns out the studio clock does that now).
+
* When the scheduled backup processes ran in the early hours of Monday morning, the backup system failed to detect that the network mountpoints were not set up on uryfs1. The system started to back up to its own root filesystem until all space was consumed, at which point services started to fail overnight. Luckily, Icecast was not an affected service, and successfully fell back to a static jukebox feed when it detected that Liquidsoap had stopped responding. This issue was not noticed until mid-afternoon on Monday when a technical member entered the station to see the Studio Clock reporting that the server was out of space (it turns out the studio clock does that now).
The ZFS pools on urybackup0 were not automatically mounted on boot. This was also not noticed until attempts to resolve the uryfs1 backup issue were made. Our primary archive filestore was therefore unavailable to members for most of the day.
+
* The ZFS pools on urybackup0 were not automatically mounted on boot. This was also not noticed until attempts to resolve the uryfs1 backup issue were made. Our primary archive filestore was therefore unavailable to members for most of the day.
    
== Changes Made ==
 
== Changes Made ==
Line 85: Line 95:     
Hopefully this’ll be of some use in the future.
 
Hopefully this’ll be of some use in the future.
 +
 
Lloyd & Anthony
 
Lloyd & Anthony
 +
 
Were-there-when-it-happened Officers
 
Were-there-when-it-happened Officers
    
[[Category:Incident Reports]]
 
[[Category:Incident Reports]]