Line 1: |
Line 1: |
| '''''The below email was a post-mortem of a Campus North power outage at 01:42am, Sunday, 1st December 2013. Assistant Head of Computing Anthony Williams and Live-Next-To-The-Station-Man Lloyd Wallis were on site to respond. This is how it went.''''' | | '''''The below email was a post-mortem of a Campus North power outage at 01:42am, Sunday, 1st December 2013. Assistant Head of Computing Anthony Williams and Live-Next-To-The-Station-Man Lloyd Wallis were on site to respond. This is how it went.''''' |
| | | |
− | == Introduction == | + | {{Incident |
| + | |brief=A displeased 12kV power cable shorted, taking out parts of the local power grid. URY's servers were unhappy with this series of events, so massaging was needed. |
| + | |severity=Critical |
| + | |impact=Medium (main outage at night, residual issues were for minor services) |
| + | |start=01/12/2013 01:42 |
| + | |end=01/12/2013 07:11 |
| + | |mitigation=Steps taken, but further action needed. |
| + | |leader=[[Anthony Williams]] <anthony@ury.org.uk> |
| + | |others=[[Lloyd Wallis]] <lpw@ury.org.uk> |
| + | }} |
| + | |
| Hey everyone, | | Hey everyone, |
| | | |
Line 47: |
Line 57: |
| == Other Points of Note == | | == Other Points of Note == |
| After the initial restoration of services, the following days raised several issues that required manual intervention to fully restore service: | | After the initial restoration of services, the following days raised several issues that required manual intervention to fully restore service: |
− | When the power failed in the IT Services Berrick Saul Datacenter, our Virtual Machine, uryrrod, was automatically powered on in the TFTA Datacenter. However, due to it having no network access and an NFS mount in its fstab, the server was found to be sitting in Single-User Mode waiting for administrator intervention when it was next checked on Sunday evening. | + | * When the power failed in the IT Services Berrick Saul Datacenter, our Virtual Machine, uryrrod, was automatically powered on in the TFTA Datacenter. However, due to it having no network access and an NFS mount in its fstab, the server was found to be sitting in Single-User Mode waiting for administrator intervention when it was next checked on Sunday evening. |
− | The Stores UPS (the main backup supply for our servers), has now successfully acknowledged that it has 6 minutes of runtime, not 3-4 (of course, we know from the experience of this event it is easily 15). | + | * The Stores UPS (the main backup supply for our servers), has now successfully acknowledged that it has 6 minutes of runtime, not 3-4 (of course, we know from the experience of this event it is easily 15). |
− | When the scheduled backup processes ran in the early hours of Monday morning, the backup system failed to detect that the network mountpoints were not set up on uryfs1. The system started to back up to its own root filesystem until all space was consumed, at which point services started to fail overnight. Luckily, Icecast was not an affected service, and successfully fell back to a static jukebox feed when it detected that Liquidsoap had stopped responding. This issue was not noticed until mid-afternoon on Monday when a technical member entered the station to see the Studio Clock reporting that the server was out of space (it turns out the studio clock does that now). | + | * When the scheduled backup processes ran in the early hours of Monday morning, the backup system failed to detect that the network mountpoints were not set up on uryfs1. The system started to back up to its own root filesystem until all space was consumed, at which point services started to fail overnight. Luckily, Icecast was not an affected service, and successfully fell back to a static jukebox feed when it detected that Liquidsoap had stopped responding. This issue was not noticed until mid-afternoon on Monday when a technical member entered the station to see the Studio Clock reporting that the server was out of space (it turns out the studio clock does that now). |
− | The ZFS pools on urybackup0 were not automatically mounted on boot. This was also not noticed until attempts to resolve the uryfs1 backup issue were made. Our primary archive filestore was therefore unavailable to members for most of the day. | + | * The ZFS pools on urybackup0 were not automatically mounted on boot. This was also not noticed until attempts to resolve the uryfs1 backup issue were made. Our primary archive filestore was therefore unavailable to members for most of the day. |
| | | |
| == Changes Made == | | == Changes Made == |
− | The monitor in the server cupboard has been moved from mains power to UPS power - this is needed to power off the loggers gracefully as they have AT power supplies. The next cable tidyup in that rack should remove the excess mains cable. | + | * The monitor in the server cupboard has been moved from mains power to UPS power - this is needed to power off the loggers gracefully as they have AT power supplies. The next cable tidyup in that rack should remove the excess mains cable. |
− | uryfs1 will now detect missing backup mounts and attempt to restore them. | + | * uryfs1 will now detect missing backup mounts and attempt to restore them. |
− | Remove the need to enter console passwords on boot for ury and uryfs1 | + | * Remove the need to enter console passwords on boot for ury and uryfs1 |
− | Decrypt the SSL certificates and review permissions on their storage directory | + | ** Decrypt the SSL certificates and review permissions on their storage directory |
| | | |
| == Changes to be Made == | | == Changes to be Made == |
| * Investigate Stores Distribution Board upgrade | | * Investigate Stores Distribution Board upgrade |
− | Re-raise the issue with Estates and YUSU | + | ** Re-raise the issue with Estates and YUSU |
− | Prevent uryfs1 and other servers from backing up if the mount fails to be brought up | + | * Prevent uryfs1 and other servers from backing up if the mount fails to be brought up |
− | Anyone want to practice their bash-fu? | + | ** Anyone want to practice their bash-fu? |
− | Ensure urybackup0 mounts /pool0 and /pool1 on boot | + | * Ensure urybackup0 mounts /pool0 and /pool1 on boot |
− | Currently need to run zfs mount pool0 && zfs mount pool1 | + | ** Currently need to run zfs mount pool0 && zfs mount pool1 |
− | Ensure uryfs1 mounts /music on boot | + | * Ensure uryfs1 mounts /music on boot |
− | Server seems to get its filesystem types confused | + | ** Server seems to get its filesystem types confused |
− | Remember to power uryfw0 up first | + | * Remember to power uryfw0 up first |
− | Much spinning waiting on NTP etc otherwise (partly not our fault) | + | ** Much spinning waiting on NTP etc otherwise (partly not our fault) |
− | Add checking backup mounts, uryrrod to our standard boot procedure, plus generally update the document | + | * Add checking backup mounts, uryrrod to our standard boot procedure, plus generally update the document |
− | https://docs.google.com/document/d/12gdrkNWPqC0hc0sJ1ETqM9TXCwmQy3ZBma1anc8ZO7M/edit | + | ** https://docs.google.com/document/d/12gdrkNWPqC0hc0sJ1ETqM9TXCwmQy3ZBma1anc8ZO7M/edit |
− | Review policy for communication on social media during a failure | + | * Review policy for communication on social media during a failure |
− | Should named Technical persons have access for these scenarios? | + | ** Should named Technical persons have access for these scenarios? |
− | Review policy for calling named persons at 2am (this kind of thing normally happens during the day) | + | * Review policy for calling named persons at 2am (this kind of thing normally happens during the day) |
− | Would Al have rather been woken up than not know? | + | ** Would Al have rather been woken up than not know? |
− | Improve documentation on Outside Broadcast systems and standard contingency plans if 802.1x is unavailable on the campus network | + | * Improve documentation on Outside Broadcast systems and standard contingency plans if 802.1x is unavailable on the campus network |
− | Known non-NAS/opfa ports, IP addresses available to us | + | ** Known non-NAS/opfa ports, IP addresses available to us |
− | Give engineering the learnings of ‘how to tell what a port is’? | + | ** Give engineering the learnings of ‘how to tell what a port is’? |
− | We should generally have a how-to guide for the freshers | + | ** We should generally have a how-to guide for the freshers |
− | Review whether the stores distribution board should be touched (I mean, look at it) | + | * Review whether the stores distribution board should be touched (I mean, look at it) |
− | Talk to Estates/Health and Safety? Last time I did I got a “What the f***” from the electrician | + | ** Talk to Estates/Health and Safety? Last time I did I got a “What the f***” from the electrician |
− | Investigate automatic shutdown options? | + | * Investigate automatic shutdown options? |
− | We now know the switch/jukebox UPS lasts a good length of time so the limiting factor of very short comms runtime (~90s) last time this was considered is no longer a problem. | + | ** We now know the switch/jukebox UPS lasts a good length of time so the limiting factor of very short comms runtime (~90s) last time this was considered is no longer a problem. |
| | | |
| Hopefully this’ll be of some use in the future. | | Hopefully this’ll be of some use in the future. |
| + | |
| Lloyd & Anthony | | Lloyd & Anthony |
| + | |
| Were-there-when-it-happened Officers | | Were-there-when-it-happened Officers |
| | | |
| [[Category:Incident Reports]] | | [[Category:Incident Reports]] |