Changes

Incident Report: 20140202 (view source)

Revision as of 13:40, 3 February 2014

114 bytes removed , 13:40, 3 February 2014

no edit summary

Line 2: Line 2:

|brief=A segmentation fault on our web server caused cascading failures on all URY Computing Services

|severity=Critical

−

|impact=High (Complete failure of URY Computing Services for 37+ minutes)

+

|impact=High (Complete failure of many URY Computing Services for 37+ minutes)

|start=02/02/2014 23:00

|end=02/02/2014 23:49

Line 33: Line 33:

* The segmentation fault that causes MyRadio to fail is still under investigation to identify the root cause.

* The stage of processing that MyRadio appears to be in at the time of segfault means that a database connection is opened, but does not get cleanly closed due to the crash. This left idle connections on the database which over time cause other systems to fail.

−

* MyRadio currently runs as a super database user.

== Work Required ==

Line 41: Line 40:

* Monitoring of system failures of this nature needs to be reviewed and improved, including automated reconnection of IRC bots and email reporting.

* A behaviour change of URY members is required to ensure that problems are reported through the correct channels. Lloyd Wallis is not a correct channel for reporting problems.

−

* MyRadio should not be configured to run as a database super user.

[[Category:Incident Reports]]

Lloyd Wallis (7449)

290

edits