Changes

no edit summary
Line 14: Line 14:  
'''This is still under construction because this literally just happened. When finished, I'll move this page into mainspace.'''
 
'''This is still under construction because this literally just happened. When finished, I'll move this page into mainspace.'''
   −
== Summary ==
+
__TOC__
 +
 
 +
= Summary =
    
A combination of a regularly scheduled offsite backup job and some strange behaviour by the studio PCs caused urybackup0 (specifically the music store) to have severely degraded performance. This in turn meant that music uploads to the central library, which normally take less than a minute, would take far longer, sometimes reaching 10 minutes and timing out. One show had to be canceled as a result because the host was not able to get all his music onto the system in time. The incident was not resolved, but mitigated by restarting Samba and the studio PCs.
 
A combination of a regularly scheduled offsite backup job and some strange behaviour by the studio PCs caused urybackup0 (specifically the music store) to have severely degraded performance. This in turn meant that music uploads to the central library, which normally take less than a minute, would take far longer, sometimes reaching 10 minutes and timing out. One show had to be canceled as a result because the host was not able to get all his music onto the system in time. The incident was not resolved, but mitigated by restarting Samba and the studio PCs.
   −
== Timeline ==
+
=== What Went Well ===
 +
 
 +
* Effective collaboration by the incident team, especially considering MP didn't really know what he was doing around backup0
 +
 
 +
=== What Did Not Go Well ===
 +
 
 +
* It took a long time to get the incident declared, by which point it had forced one show to cancel
 +
* We had limited insight into what was going on with samba/zfs, limited to htop and tcpdumps
 +
* Took some time to find the people who knew what they were doing
 +
* Red herrings abound (FFMPEG, RAID issues, hammering, the backup job - any or all of these could have been a red herring)
 +
 
 +
=== How We Got Lucky ===
 +
 
 +
* After the samba and studioPCs restart, performance brought itself back down to acceptable levels
 +
 
 +
= Next Steps =
 +
 
 +
* Continue investigating - no real conclusion as to the causes yet
 +
* We Do Not Patch Shit Without Telling Anyone, and anyone who does that will be slapped around a bit with a trout
 +
 
 +
= Timeline =
    
(all times GMT)
 
(all times GMT)