Incident Report: 20140202
| Incident Report | |
|---|---|
| A segmentation fault on our web server caused cascading failures on all URY Computing Services | |
| Summary | |
| Severity | Critical | 
| Impact | High (Complete failure of many URY Computing Services for 37+ minutes) | 
| Event Start | 02/02/2014 23:00 | 
| Event End | 02/02/2014 23:49 | 
| Recurrence Mitigation | Multiple actions required to prevent recurrence. | 
| Contacts | |
| Recovery Leader | Anthony Williams <lpw@ury.org.uk> | 
| Other Attendees | Andrew Durant <aj@ury.org.uk> | 
An at-the-time regular failure of PHP5 APC package caused many of URY's Computing Services to be unavailable for a period of time on Sunday 2nd of February.
Due to a system issue under active investigation, PHP's APC module, which underpins the MyRadio caching system, the ability to service some PHP requests currently regularly fails due to a segmentation fault. This issue breaks member-facing services, not public facing, and due to monitoring systems is usually restored within 5 minutes of failing. Investigation currently involves increasing levels of debug compile options being enabled to locate the root cause of the segmentation fault. It was also at the time not clear whether or not this was related to another issue, where some Apache modules would fail after a log rotate due to not updating file pointers correctly.
From approximately 18:25 until 18:35 on 2nd February, The University of York's connection to Janet briefly failed for reasons currently unknown (src: "gavinatkinson doesn't actually know what happened with the offsite glitch"). Several monitoring IRC robots that often provide useful information, including xymon-bsod, our service monitoring bot, dropped off from Freenode and failed to reconnect once access was restored.
Due to xymon-bsod being offline, none of us were aware of the failure of myradio_daemon, one of our backend services, and the increasing load averages on our web server, which would have notified us when the initial segmentation fault appeared at 23:00 and load gradually increased.
At 23:12, Pyramid, the framework that our website is based on, started to report failures with some of its backend requests which are based on the MyRadio API or certain parts of the database. At this point, xymon-bsod would have likely picked up HTTP response alerts too.
At 23:27, Pyramid started completely timing out processing new requests. At this point, Apache's wsgi handlers also started queueing up waiting for responses from Pyramid, retrying several times before themselves timing out.
At 23:28, the first user report was received, via a private Facebook message to Lloyd Wallis. Lloyd was unavailable at the time and the message went unnoticed. Five more users reported the failure in this way over the next few minutes.
At 23:33, our PostgreSQL database reported that service had reached the maximum number of allowed active connections and stopped serving new requests. At this time, services such as our Jukebox Scheduler, Tracklisting and BAPS all stopped working, significantly hampering broadcast capabilities.
At 23:34 the scale of the outage finally led to phone calls and other notifications drawing attention of the issue to the actual URY Computing Team. At this point the load on our web server was over 90. The database is diagnosed as the cause of many issues and a restart was requested at 23:38, but was delayed waiting for the idle connections to terminate. The restart of this service was completed at 23:42 and services required for broadcast recovered.
Apache was then stopped at 23:48 to allow the web server to calm down and recover to a regular load average. It was restarted at 23:49, at which point all URY Computing Services were once again operating normally.
Causes
- The segmentation fault that causes MyRadio to fail is still under investigation to identify the root cause.
- The stage of processing that MyRadio appears to be in at the time of segfault means that a database connection is opened, but does not get cleanly closed due to the crash. This left idle connections on the database which over time cause other systems to fail.
Work Required
- A review of the MyRadio code is required to ensure that any steps possible are taken to cleanly terminate database connections after a failure.
- A review of PostgreSQL is required to see if it can have reduced idle timeouts or better handling of broken connections.
- The root cause of the APC Segmentation Faults needs to be discovered and rectified, replacing APC with another solution if necessary.
- Monitoring of system failures of this nature needs to be reviewed and improved, including automated reconnection of IRC bots and email reporting.
- A behaviour change of URY members is required to ensure that problems are reported through the correct channels. Lloyd Wallis is not a correct channel for reporting problems.