Incident Report: 20191104: Difference between revisions

Revision as of 12:01, 5 November 2019

Incident Report
We DDoS'd ourselves

Summary
Severity	Moderate
Impact	High (Manual uploaded non-functional, one canceled show as a result)
Event Start	04/11/2019 09:00
Event End	05/11/2019 02:00
Recurrence Mitigation	Stop the self-DDoS, restart samba, cry

Contacts
Recovery Leader	Marks Polakovs (MP)
Other Attendees	Matthew Stratford (MS), Isaac Lowe (IL), Anthony Williams (AW)

This is still under construction because this literally just happened. When finished, I'll move this page into mainspace.

A combination of a regularly scheduled offsite backup job and some strange behaviour by the studio PCs caused urybackup0 (specifically the music store) to have severely degraded performance. This in turn meant that music uploads to the central library, which normally take less than a minute, would take far longer, sometimes reaching 10 minutes and timing out. One show had to be canceled as a result because the host was not able to get all his music onto the system in time.

Timeline

(all times GMT)

9:00 - A regularly scheduled offsite backup job, which should be nightly and incremental... was not either of those things, for reasons unknown to man. It starts copying the entirety of the music store from backup0 to moyles (in IT Services TFTA data-center on East), which starts hammering the ZFS pools on backup0.

20:48 - A member reports in Slack that they are unable to upload music. With just over an hour to go to their show, and 12 songs left to upload, they are forced to cancel.

21:32 - MP alerts #computing in Slack to a serious incident. INCIDENT BEGINS.

21:35 - MS reports that he had done some package updates on dolby(?) over the weekend, including ffmpeg, which may be causing stuck uploads. (This is eventually shown to be a red herring, but is important to include in this report.)

Between these entries, the computing team are sidetracked by an issue with MyRadio emails

22:30 - MS reports that ffmpeg conversions are working manually.

22:36 - MP adds debug logging to the MyRadio upload controller, showing that it is hanging on a call to move_uploaded_file, a built-in PHP function. This leads him to point the finger to the music store.

22:55 - AW logs on to backup0 to troubleshoot the pools, sees the degraded performance, but notes that zpool reports it is running fine.

22:59 - AW notices the stuck(?) offsite backup job

23:04 - AW notes that the RAID controllers are chugging as well

23:10 - MP and AW decide to restart Samba, but have to wait until midnight as there is a show on air

23:13 - AW notices that a Samba log is generating 100KB/s, and a 5MB log size limit means that it is rotating the log every few seconds. He also notes that the ARC cache is very low and that swap is not enabled. Around this time, a very frustrated MP goes for a run to let off steam.

23:25 - AW notes that the incremental backup job... is not incremental... and has been sending the entire music dataset to moyles for over 12 hours

23:35 - IL logs on and reports that the backup was intentionally not an incremental one (for reasons that the author (MP) does not yet understand). At 23:38 he kills it.

23:49 - MS and IL report that the Windows 10 photos app is seemingly hammering backup0's shares

00:05 - MP returns and stops samba. Performance on backup0 immediately jumps tenfold. However, MP screws up and doesn't realise that samba also handles his credentials (from the Windows domain).

00:10 - MS logs on with his local account and restarts samba. Performance degrades immediately. MP notes that Xymon does not monitor samba on backup0 at all.

00:14 - AW notes that the picture hammering noted earlier is likely one of the studio PCs.

00:38-00:43 - MP remotes into wogan and restarts all the studio presenter and guest PCs. At 00:56 he reports that performance has not, in fact, improved substantially.

01:06 - MS tests song uploading and notes that, while it is still slow, it has improved dramatically. INCIDENT ENDS

01:26 - MP and the gang head to bed.

@@ Line 6: / Line 6: @@
    |end=05/11/2019 02:00
    |mitigation=Stop the self-DDoS, restart samba, cry
-   |leader=markspolakovs
+   |leader=Marks Polakovs (MP)
-   |others=mstratford, mrfitzherbert, LordAro, pomp_and_biscuit
+   |others=Matthew Stratford (MS), Isaac Lowe (IL), Anthony Williams (AW)
 }}
@@ Line 16: / Line 16: @@
 == Timeline ==
-(All times GMT)
+(all times GMT)
 :00 - A regularly scheduled offsite backup job, which should be nightly and incremental... was not either of those things, for reasons unknown to man. It starts copying the entirety of the music store from backup0 to moyles (in IT Services TFTA data-center on East), which starts hammering the ZFS pools on backup0.
@@ Line 22: / Line 22: @@
 :48 - A member reports in Slack that they are unable to upload music. With just over an hour to go to their show, and 12 songs left to upload, they are forced to cancel.
-:32 - markspolakovs alerts #computing in Slack to a serious incident. '''INCIDENT BEGINS.'''
+:32 - MP alerts #computing in Slack to a serious incident. '''INCIDENT BEGINS.'''
-:35 - mstratford reports that he had done some package updates on dolby(?) over the weekend, including ffmpeg, which may be causing stuck uploads. (This is eventually shown to be a red herring, but is important to include in this report.)
+:35 - MS reports that he had done some package updates on dolby(?) over the weekend, including ffmpeg, which may be causing stuck uploads. (This is eventually shown to be a red herring, but is important to include in this report.)
 ''Between these entries, the computing team are sidetracked by an issue with MyRadio emails''
-:30 - mstratford reports that ffmpeg conversions are working manually.
+:30 - MS reports that ffmpeg conversions are working manually.
-:36 - markspolakovs adds debug logging to the MyRadio upload controller, showing that it is hanging on a call to <code>move_uploaded_file</code>, a built-in PHP function. This leads him to point the finger to the music store.
+:36 - MP adds debug logging to the MyRadio upload controller, showing that it is hanging on a call to <code>move_uploaded_file</code>, a built-in PHP function. This leads him to point the finger to the music store.
-:55 - mrfitzherbert logs on to backup0 to troubleshoot the pools, sees the degraded performance, but notes that zpool reports it is running fine.
+:55 - AW logs on to backup0 to troubleshoot the pools, sees the degraded performance, but notes that zpool reports it is running fine.
-:59 - mrfitzherbert notices the stuck(?) offsite backup job
+:59 - AW notices the stuck(?) offsite backup job
-:04 - mrfitzherbert notes that the RAID controllers are chugging as well
+:04 - AW notes that the RAID controllers are chugging as well
-:10 - markspolakovs and mrfitzherbert decide to restart Samba, but have to wait until midnight as there is a show on air
+:10 - MP and AW decide to restart Samba, but have to wait until midnight as there is a show on air
-:13 - mrfitzherbert notices that a Samba log is generating 100KB/s, and a 5MB log size limit means that it is rotating the log every few seconds. He also notes that the ARC cache is very low and that swap is not enabled. Around this time, a very frustrated markspolakovs goes for a run to let off steam.
+:13 - AW notices that a Samba log is generating 100KB/s, and a 5MB log size limit means that it is rotating the log every few seconds. He also notes that the ARC cache is very low and that swap is not enabled. Around this time, a very frustrated MP goes for a run to let off steam.
-:25 - mrfitzherbert notes that the incremental backup job... is not incremental... and has been sending the entire music dataset to moyles for over 12 hours
+:25 - AW notes that the incremental backup job... is not incremental... and has been sending the entire music dataset to moyles for over 12 hours
-:35 - pomp_and_biscuit logs on and reports that the backup was ''intentionally'' not an incremental one (for reasons that the author (markspolakovs) does not yet understand). At 23:38 he kills it.
+:35 - IL logs on and reports that the backup was ''intentionally'' not an incremental one (for reasons that the author (MP) does not yet understand). At 23:38 he kills it.
-:49 - mstratford and pomp_and_biscuit report that the Windows 10 photos app is seemingly hammering backup0's shares
+:49 - MS and IL report that the Windows 10 photos app is seemingly hammering backup0's shares
-:05 - markspolakovs returns and stops samba. Performance on backup0 immediately jumps tenfold. However, markspolakovs screws up and doesn't realise that samba also handles his credentials (from the Windows domain).
+:05 - MP returns and stops samba. Performance on backup0 immediately jumps tenfold. However, MP screws up and doesn't realise that samba also handles his credentials (from the Windows domain).
-:10 - mstratford logs on with his local account and restarts samba. Performance degrades immediately. markspolakovs notes that Xymon does not monitor samba on backup0 at all.
+:10 - MS logs on with his local account and restarts samba. Performance degrades immediately. MP notes that Xymon does not monitor samba on backup0 at all.
-:14 - mrfitzherbert notes that the picture hammering noted earlier is likely one of the studio PCs.
+:14 - AW notes that the picture hammering noted earlier is likely one of the studio PCs.
-:38-00:43 - markspolakovs remotes into wogan and restarts all the studio presenter and guest PCs. At 00:56 he reports that performance has not, in fact, improved substantially.
+:38-00:43 - MP remotes into wogan and restarts all the studio presenter and guest PCs. At 00:56 he reports that performance has not, in fact, improved substantially.
-:06 - mstratford tests song uploading and notes that, while it is still slow, it has improved dramatically. '''INCIDENT ENDS'''
+:06 - MS tests song uploading and notes that, while it is still slow, it has improved dramatically. '''INCIDENT ENDS'''
-:26 - markspolakovs and the gang head to bed.
+:26 - MP and the gang head to bed.

Incident Report: 20191104: Difference between revisions

Revision as of 12:01, 5 November 2019

Timeline

Navigation menu

Search