Incident Report: 20140202: Difference between revisions

Created page with "{{Incident |brief=A segmentation fault on our web server caused cascading failures on all URY Computing Services |severity=Critical |impact=High (Complete failure of URY..."
 
No edit summary
 
(One intermediate revision by the same user not shown)
Line 2: Line 2:
   |brief=A segmentation fault on our web server caused cascading failures on all URY Computing Services
   |brief=A segmentation fault on our web server caused cascading failures on all URY Computing Services
   |severity=Critical
   |severity=Critical
   |impact=High (Complete failure of URY Computing Services for 37+ minutes)
   |impact=High (Complete failure of many URY Computing Services for 37+ minutes)
   |start=02/02/2014 23:00
   |start=02/02/2014 23:00
   |end=02/02/2014 23:49
   |end=02/02/2014 23:49
   |mitigation=Multiple actions required to prevent recurrence.
   |mitigation=Multiple actions required to prevent recurrence.
   |leader=[[Anthony Williams]] <lpw@ury.org.uk>
   |leader=[[Anthony Williams]] <anthony@ury.org.uk>
   |others=[[Andrew Durant]] <aj@ury.org.uk>
   |others=[[Andrew Durant]] <aj@ury.org.uk>
}}
}}
Line 22: Line 22:
At 23:27, Pyramid started completely timing out processing new requests. At this point, Apache's wsgi handlers also started queueing up waiting for responses from Pyramid, retrying several times before themselves timing out.
At 23:27, Pyramid started completely timing out processing new requests. At this point, Apache's wsgi handlers also started queueing up waiting for responses from Pyramid, retrying several times before themselves timing out.


At 23:28, the first user report was received, via a private Facebook message to [[Lloyd Wallis]]. Lloyd was unavailable at the time and the message went unnoticed. Five more users reported the failure in this way over the next few minutes.
At 23:28, the first user report was received, via a private Facebook message to [[Lloyd Wallis]]. Lloyd was unavailable at the time and the message went unnoticed. Four more users reported the failure in this way over the next few minutes.


At 23:33, our PostgreSQL database reported that service had reached the maximum number of allowed active connections and stopped serving new requests. At this time, services such as our Jukebox Scheduler, Tracklisting and BAPS all stopped working, significantly hampering broadcast capabilities.
At 23:33, our PostgreSQL database reported that service had reached the maximum number of allowed active connections and stopped serving new requests. At this time, services such as our Jukebox Scheduler, Tracklisting and BAPS all stopped working, significantly hampering broadcast capabilities.
Line 33: Line 33:
* The segmentation fault that causes MyRadio to fail is still under investigation to identify the root cause.
* The segmentation fault that causes MyRadio to fail is still under investigation to identify the root cause.
* The stage of processing that MyRadio appears to be in at the time of segfault means that a database connection is opened, but does not get cleanly closed due to the crash. This left idle connections on the database which over time cause other systems to fail.
* The stage of processing that MyRadio appears to be in at the time of segfault means that a database connection is opened, but does not get cleanly closed due to the crash. This left idle connections on the database which over time cause other systems to fail.
* MyRadio currently runs as a super database user.


== Work Required ==
== Work Required ==
Line 41: Line 40:
* Monitoring of system failures of this nature needs to be reviewed and improved, including automated reconnection of IRC bots and email reporting.
* Monitoring of system failures of this nature needs to be reviewed and improved, including automated reconnection of IRC bots and email reporting.
* A behaviour change of URY members is required to ensure that problems are reported through the correct channels. Lloyd Wallis is not a correct channel for reporting problems.
* A behaviour change of URY members is required to ensure that problems are reported through the correct channels. Lloyd Wallis is not a correct channel for reporting problems.
* MyRadio should not be configured to run as a database super user.


[[Category:Incident Reports]]
[[Category:Incident Reports]]