Changes

Jump to navigation Jump to search
no edit summary
Line 1: Line 1:  
'''''The below email was a post-mortem of a Campus North power outage at 01:42am, Sunday, 1st December 2013. Assistant Head of Computing Anthony Williams and Live-Next-To-The-Station-Man Lloyd Wallis were on site to respond. This is how it went.'''''
 
'''''The below email was a post-mortem of a Campus North power outage at 01:42am, Sunday, 1st December 2013. Assistant Head of Computing Anthony Williams and Live-Next-To-The-Station-Man Lloyd Wallis were on site to respond. This is how it went.'''''
    +
== Introduction ==
 
Hey everyone,
 
Hey everyone,
   Line 52: Line 53:     
== Changes Made ==
 
== Changes Made ==
* The monitor in the server cupboard has been moved from mains power to UPS power - this is needed to power off the loggers gracefully as they have AT power supplies. The next cable tidyup in that rack should remove the excess mains cable.
+
The monitor in the server cupboard has been moved from mains power to UPS power - this is needed to power off the loggers gracefully as they have AT power supplies. The next cable tidyup in that rack should remove the excess mains cable.
* uryfs1 will now detect missing backup mounts and attempt to restore them.
+
uryfs1 will now detect missing backup mounts and attempt to restore them.
* Remove the need to enter console passwords on boot for ury and uryfs1
+
Remove the need to enter console passwords on boot for ury and uryfs1
** Decrypt the SSL certificates and review permissions on their storage directory
+
Decrypt the SSL certificates and review permissions on their storage directory
    
== Changes to be Made ==
 
== Changes to be Made ==
 
* Investigate Stores Distribution Board upgrade
 
* Investigate Stores Distribution Board upgrade
** Re-raise the issue with Estates and YUSU
+
Re-raise the issue with Estates and YUSU
* Prevent uryfs1 and other servers from backing up if the mount fails to be brought up
+
Prevent uryfs1 and other servers from backing up if the mount fails to be brought up
** Anyone want to practice their bash-fu?
+
Anyone want to practice their bash-fu?
* Ensure urybackup0 mounts /pool0 and /pool1 on boot
+
Ensure urybackup0 mounts /pool0 and /pool1 on boot
** Currently need to run zfs mount pool0 && zfs mount pool1
+
Currently need to run zfs mount pool0 && zfs mount pool1
* Ensure uryfs1 mounts /music on boot
+
Ensure uryfs1 mounts /music on boot
** Server seems to get its filesystem types confused
+
Server seems to get its filesystem types confused
* Remember to power uryfw0 up first
+
Remember to power uryfw0 up first
** Much spinning waiting on NTP etc otherwise (partly not our fault)
+
Much spinning waiting on NTP etc otherwise (partly not our fault)
* Add checking backup mounts, uryrrod to our standard boot procedure, plus generally update the document
+
Add checking backup mounts, uryrrod to our standard boot procedure, plus generally update the document
** https://docs.google.com/document/d/12gdrkNWPqC0hc0sJ1ETqM9TXCwmQy3ZBma1anc8ZO7M/edit
+
https://docs.google.com/document/d/12gdrkNWPqC0hc0sJ1ETqM9TXCwmQy3ZBma1anc8ZO7M/edit
* Review policy for communication on social media during a failure
+
Review policy for communication on social media during a failure
** Should named Technical persons have access for these scenarios?
+
Should named Technical persons have access for these scenarios?
* Review policy for calling named persons at 2am (this kind of thing normally happens during the day)
+
Review policy for calling named persons at 2am (this kind of thing normally happens during the day)
** Would Al have rather been woken up than not know?
+
Would Al have rather been woken up than not know?
* Improve documentation on Outside Broadcast systems and standard contingency plans if 802.1x is unavailable on the campus network
+
Improve documentation on Outside Broadcast systems and standard contingency plans if 802.1x is unavailable on the campus network
** Known non-NAS/opfa ports, IP addresses available to us
+
Known non-NAS/opfa ports, IP addresses available to us
** Give engineering the learnings of ‘how to tell what a port is’?
+
Give engineering the learnings of ‘how to tell what a port is’?
** We should generally have a how-to guide for the freshers
+
We should generally have a how-to guide for the freshers
* Review whether the stores distribution board should be touched (I mean, look at it)
+
Review whether the stores distribution board should be touched (I mean, look at it)
** Talk to Estates/Health and Safety? Last time I did I got a “What the f***” from the electrician
+
Talk to Estates/Health and Safety? Last time I did I got a “What the f***” from the electrician
* Investigate automatic shutdown options?
+
Investigate automatic shutdown options?
** We now know the switch/jukebox UPS lasts a good length of time so the limiting factor of very short comms runtime (~90s) last time this was considered is no longer a problem.
+
We now know the switch/jukebox UPS lasts a good length of time so the limiting factor of very short comms runtime (~90s) last time this was considered is no longer a problem.
    
Hopefully this’ll be of some use in the future.
 
Hopefully this’ll be of some use in the future.
   
Lloyd & Anthony
 
Lloyd & Anthony
   
Were-there-when-it-happened Officers
 
Were-there-when-it-happened Officers
    
[[Category:Incident Reports]]
 
[[Category:Incident Reports]]

Navigation menu