Incident Report: 20140202: Difference between revisions
Created page with "{{Incident |brief=A segmentation fault on our web server caused cascading failures on all URY Computing Services |severity=Critical |impact=High (Complete failure of URY..." |
No edit summary |
||
Line 2: | Line 2: | ||
|brief=A segmentation fault on our web server caused cascading failures on all URY Computing Services | |brief=A segmentation fault on our web server caused cascading failures on all URY Computing Services | ||
|severity=Critical | |severity=Critical | ||
|impact=High (Complete failure of URY Computing Services for 37+ minutes) | |impact=High (Complete failure of many URY Computing Services for 37+ minutes) | ||
|start=02/02/2014 23:00 | |start=02/02/2014 23:00 | ||
|end=02/02/2014 23:49 | |end=02/02/2014 23:49 | ||
Line 33: | Line 33: | ||
* The segmentation fault that causes MyRadio to fail is still under investigation to identify the root cause. | * The segmentation fault that causes MyRadio to fail is still under investigation to identify the root cause. | ||
* The stage of processing that MyRadio appears to be in at the time of segfault means that a database connection is opened, but does not get cleanly closed due to the crash. This left idle connections on the database which over time cause other systems to fail. | * The stage of processing that MyRadio appears to be in at the time of segfault means that a database connection is opened, but does not get cleanly closed due to the crash. This left idle connections on the database which over time cause other systems to fail. | ||
== Work Required == | == Work Required == | ||
Line 41: | Line 40: | ||
* Monitoring of system failures of this nature needs to be reviewed and improved, including automated reconnection of IRC bots and email reporting. | * Monitoring of system failures of this nature needs to be reviewed and improved, including automated reconnection of IRC bots and email reporting. | ||
* A behaviour change of URY members is required to ensure that problems are reported through the correct channels. Lloyd Wallis is not a correct channel for reporting problems. | * A behaviour change of URY members is required to ensure that problems are reported through the correct channels. Lloyd Wallis is not a correct channel for reporting problems. | ||
[[Category:Incident Reports]] | [[Category:Incident Reports]] |