Difference between revisions of "Incident Report: 20131217"

From URY Wiki
Jump to navigation Jump to search
(Created page with "{{Incident |brief=Some planned IT Service maintenance that went as planned had an unplanned impact on one of our servers. |severity=Medium |impact=Low (service users hav...")
 
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
[[Category:Incident Reports]]
 
{{Incident
 
{{Incident
   |brief=Some planned IT Service maintenance that went as planned had an unplanned impact on one of our servers.
+
   |brief=Some planned IT Services maintenance that went as planned had an unplanned impact on one of our servers.
   |severity=Medium
+
   |severity=Low
   |impact=Low (service users have perished)
+
   |impact=None (Only Production in term-time)
 
   |start=17/12/2013 07:30 (est)
 
   |start=17/12/2013 07:30 (est)
 
   |end=19/12/2013 11:45
 
   |end=19/12/2013 11:45

Latest revision as of 09:48, 20 January 2014

Incident Report
Some planned IT Services maintenance that went as planned had an unplanned impact on one of our servers.
Summary
Severity Low
Impact None (Only Production in term-time)
Event Start 17/12/2013 07:30 (est)
Event End 19/12/2013 11:45
Recurrence Mitigation Some action taken, further action to be considered.
Contacts
Recovery Leader Lloyd Wallis <lpw@ury.org.uk>
Other Attendees Gavin Atkinson <gavin@ury.org.uk>


On Tuesday, 17th December 2013, IT Services did some long planned work on some of the filestores that back various campus services, including RentedFS and FlexFS, which back parts of their virtualisation infrastructure. Our understanding of the maintenance was that the systems would be moved from the legacy csrv.york.ac.uk domain to its.york.ac.uk.

However, unbeknownst to us, the filestores were upgraded at the same time during this period, briefly taking them offline. As such, I/O to VMs running in the IT Services Cloud had their requests queued for some time. Our virtual server, uryrrod.york.ac.uk, failed to recover when its virtual had disk was reconnected, as the queued I/O had already timed out so the machine fell over.

This was discovered on the Thursday morning, two days after the incident. The server was rebooted, but it failed to come back up cleanly - we had ourselves a corrupted journal! Several minutes of manual fsck'ing later, the system came back up cleanly and mixclouder operation was restored.

Proposed Actions

  • IT Services increase the IO timeout on their VMs for this reason. We should investigate doing the same.
  • Xymon does not currently have any monitoring enabled for uryrrod. This should monitor at least system availability.
  • Make further use of the Snapshot capabilities of VMWare. Once the server was restored, a snapshot was taken to enable easy restoration in the event of recurrence, but this should possibly become a more regular process.