Incident Report: 20131217

Incident Report
Some planned IT Services maintenance that went as planned had an unplanned impact on one of our servers.

Summary
Severity	Low
Impact	None (Only Production in term-time)
Event Start	17/12/2013 07:30 (est)
Event End	19/12/2013 11:45
Recurrence Mitigation	Some action taken, further action to be considered.

Contacts
Recovery Leader	Lloyd Wallis <lpw@ury.org.uk>
Other Attendees	Gavin Atkinson <gavin@ury.org.uk>

On Tuesday, 17th December 2013, IT Services did some long planned work on some of the filestores that back various campus services, including RentedFS and FlexFS, which back parts of their virtualisation infrastructure. Our understanding of the maintenance was that the systems would be moved from the legacy csrv.york.ac.uk domain to its.york.ac.uk.

However, unbeknownst to us, the filestores were upgraded at the same time during this period, briefly taking them offline. As such, I/O to VMs running in the IT Services Cloud had their requests queued for some time. Our virtual server, uryrrod.york.ac.uk, failed to recover when its virtual had disk was reconnected, as the queued I/O had already timed out so the machine fell over.

This was discovered on the Thursday morning, two days after the incident. The server was rebooted, but it failed to come back up cleanly - we had ourselves a corrupted journal! Several minutes of manual fsck'ing later, the system came back up cleanly and mixclouder operation was restored.

Proposed Actions

IT Services increase the IO timeout on their VMs for this reason. We should investigate doing the same.
Xymon does not currently have any monitoring enabled for uryrrod. This should monitor at least system availability.
Make further use of the Snapshot capabilities of VMWare. Once the server was restored, a snapshot was taken to enable easy restoration in the event of recurrence, but this should possibly become a more regular process.

Incident Report: 20131217

Proposed Actions

Navigation menu

Search