Incident Report: 20140202: Difference between revisions
|  Created page with "{{Incident   |brief=A segmentation fault on our web server caused cascading failures on all URY Computing Services   |severity=Critical   |impact=High (Complete failure of URY..." | No edit summary | ||
| Line 2: | Line 2: | ||
|    |brief=A segmentation fault on our web server caused cascading failures on all URY Computing Services |    |brief=A segmentation fault on our web server caused cascading failures on all URY Computing Services | ||
|    |severity=Critical |    |severity=Critical | ||
|    |impact=High (Complete failure of URY Computing Services for 37+ minutes) |    |impact=High (Complete failure of many URY Computing Services for 37+ minutes) | ||
|    |start=02/02/2014 23:00 |    |start=02/02/2014 23:00 | ||
|    |end=02/02/2014 23:49 |    |end=02/02/2014 23:49 | ||
| Line 33: | Line 33: | ||
| * The segmentation fault that causes MyRadio to fail is still under investigation to identify the root cause. | * The segmentation fault that causes MyRadio to fail is still under investigation to identify the root cause. | ||
| * The stage of processing that MyRadio appears to be in at the time of segfault means that a database connection is opened, but does not get cleanly closed due to the crash. This left idle connections on the database which over time cause other systems to fail. | * The stage of processing that MyRadio appears to be in at the time of segfault means that a database connection is opened, but does not get cleanly closed due to the crash. This left idle connections on the database which over time cause other systems to fail. | ||
| == Work Required == | == Work Required == | ||
| Line 41: | Line 40: | ||
| * Monitoring of system failures of this nature needs to be reviewed and improved, including automated reconnection of IRC bots and email reporting. | * Monitoring of system failures of this nature needs to be reviewed and improved, including automated reconnection of IRC bots and email reporting. | ||
| * A behaviour change of URY members is required to ensure that problems are reported through the correct channels. Lloyd Wallis is not a correct channel for reporting problems. | * A behaviour change of URY members is required to ensure that problems are reported through the correct channels. Lloyd Wallis is not a correct channel for reporting problems. | ||
| [[Category:Incident Reports]] | [[Category:Incident Reports]] | ||