Line 1: |
Line 1: |
| '''''The below email was a post-mortem of a Campus North power outage at 01:42am, Sunday, 1st December 2013. Assistant Head of Computing Anthony Williams and Live-Next-To-The-Station-Man Lloyd Wallis were on site to respond. This is how it went.''''' | | '''''The below email was a post-mortem of a Campus North power outage at 01:42am, Sunday, 1st December 2013. Assistant Head of Computing Anthony Williams and Live-Next-To-The-Station-Man Lloyd Wallis were on site to respond. This is how it went.''''' |
| | | |
| + | == Introduction == |
| Hey everyone, | | Hey everyone, |
| | | |
Line 52: |
Line 53: |
| | | |
| == Changes Made == | | == Changes Made == |
− | * The monitor in the server cupboard has been moved from mains power to UPS power - this is needed to power off the loggers gracefully as they have AT power supplies. The next cable tidyup in that rack should remove the excess mains cable.
| + | The monitor in the server cupboard has been moved from mains power to UPS power - this is needed to power off the loggers gracefully as they have AT power supplies. The next cable tidyup in that rack should remove the excess mains cable. |
− | * uryfs1 will now detect missing backup mounts and attempt to restore them.
| + | uryfs1 will now detect missing backup mounts and attempt to restore them. |
− | * Remove the need to enter console passwords on boot for ury and uryfs1
| + | Remove the need to enter console passwords on boot for ury and uryfs1 |
− | ** Decrypt the SSL certificates and review permissions on their storage directory
| + | Decrypt the SSL certificates and review permissions on their storage directory |
| | | |
| == Changes to be Made == | | == Changes to be Made == |
| * Investigate Stores Distribution Board upgrade | | * Investigate Stores Distribution Board upgrade |
− | ** Re-raise the issue with Estates and YUSU
| + | Re-raise the issue with Estates and YUSU |
− | * Prevent uryfs1 and other servers from backing up if the mount fails to be brought up
| + | Prevent uryfs1 and other servers from backing up if the mount fails to be brought up |
− | ** Anyone want to practice their bash-fu?
| + | Anyone want to practice their bash-fu? |
− | * Ensure urybackup0 mounts /pool0 and /pool1 on boot
| + | Ensure urybackup0 mounts /pool0 and /pool1 on boot |
− | ** Currently need to run zfs mount pool0 && zfs mount pool1
| + | Currently need to run zfs mount pool0 && zfs mount pool1 |
− | * Ensure uryfs1 mounts /music on boot
| + | Ensure uryfs1 mounts /music on boot |
− | ** Server seems to get its filesystem types confused
| + | Server seems to get its filesystem types confused |
− | * Remember to power uryfw0 up first
| + | Remember to power uryfw0 up first |
− | ** Much spinning waiting on NTP etc otherwise (partly not our fault)
| + | Much spinning waiting on NTP etc otherwise (partly not our fault) |
− | * Add checking backup mounts, uryrrod to our standard boot procedure, plus generally update the document
| + | Add checking backup mounts, uryrrod to our standard boot procedure, plus generally update the document |
− | ** https://docs.google.com/document/d/12gdrkNWPqC0hc0sJ1ETqM9TXCwmQy3ZBma1anc8ZO7M/edit
| + | https://docs.google.com/document/d/12gdrkNWPqC0hc0sJ1ETqM9TXCwmQy3ZBma1anc8ZO7M/edit |
− | * Review policy for communication on social media during a failure
| + | Review policy for communication on social media during a failure |
− | ** Should named Technical persons have access for these scenarios?
| + | Should named Technical persons have access for these scenarios? |
− | * Review policy for calling named persons at 2am (this kind of thing normally happens during the day)
| + | Review policy for calling named persons at 2am (this kind of thing normally happens during the day) |
− | ** Would Al have rather been woken up than not know?
| + | Would Al have rather been woken up than not know? |
− | * Improve documentation on Outside Broadcast systems and standard contingency plans if 802.1x is unavailable on the campus network
| + | Improve documentation on Outside Broadcast systems and standard contingency plans if 802.1x is unavailable on the campus network |
− | ** Known non-NAS/opfa ports, IP addresses available to us
| + | Known non-NAS/opfa ports, IP addresses available to us |
− | ** Give engineering the learnings of ‘how to tell what a port is’?
| + | Give engineering the learnings of ‘how to tell what a port is’? |
− | ** We should generally have a how-to guide for the freshers
| + | We should generally have a how-to guide for the freshers |
− | * Review whether the stores distribution board should be touched (I mean, look at it)
| + | Review whether the stores distribution board should be touched (I mean, look at it) |
− | ** Talk to Estates/Health and Safety? Last time I did I got a “What the f***” from the electrician
| + | Talk to Estates/Health and Safety? Last time I did I got a “What the f***” from the electrician |
− | * Investigate automatic shutdown options?
| + | Investigate automatic shutdown options? |
− | ** We now know the switch/jukebox UPS lasts a good length of time so the limiting factor of very short comms runtime (~90s) last time this was considered is no longer a problem.
| + | We now know the switch/jukebox UPS lasts a good length of time so the limiting factor of very short comms runtime (~90s) last time this was considered is no longer a problem. |
| | | |
| Hopefully this’ll be of some use in the future. | | Hopefully this’ll be of some use in the future. |
− |
| |
| Lloyd & Anthony | | Lloyd & Anthony |
− |
| |
| Were-there-when-it-happened Officers | | Were-there-when-it-happened Officers |
| | | |
| [[Category:Incident Reports]] | | [[Category:Incident Reports]] |