Changes

Jump to navigation Jump to search
Updates per AW and IL
Line 18: Line 18:  
= Summary =
 
= Summary =
   −
A combination of a regularly scheduled offsite backup job and some strange behaviour by the studio PCs caused urybackup0 (specifically the music store) to have severely degraded performance. This in turn meant that music uploads to the central library, which normally take less than a minute, would take far longer, sometimes reaching 10 minutes and timing out. One show had to be canceled as a result because the host was not able to get all his music onto the system in time. The incident was not resolved, but mitigated by restarting Samba and the studio PCs.
+
A combination of a scheduled offsite backup job being run irregularly and some strange behaviour by the studio PCs caused urybackup0 (specifically the music store) to have severely degraded performance. This in turn meant that music uploads to the central library, which normally take less than a minute, would take far longer, sometimes reaching 10 minutes and timing out. One show had to be canceled as a result because the host was not able to get all his music onto the system in time.
 +
 
 +
== Resolution ==
 +
 
 +
The guilty Windows PCs were restarted, and the Windows 10 Photos app (was/will be?) uninstalled to prevent this from reoccuring.
 +
 
 +
== Analysis ==
 +
 
 +
This performance degradation is not new - looking at xymon graphs (specifically, CPU rolling average and I/O on urybackup0) suggests that the issue has been going on for many months. Anecdotally, presenters have reported that music upload performance has been slow. This could also be related to the jukebox-jokebox issues, as it may have trouble fetching song files over NFS.
 +
 
 +
What made it particularly noticeable this time was the ZFS backup job, which was the straw that broke the camel's back. AW speculates this could have made the RAID controllers I/O bound and unable to service requests in time.
    
=== What Went Well ===
 
=== What Went Well ===
Line 27: Line 37:     
* It took a long time to get the incident declared, by which point it had forced one show to cancel
 
* It took a long time to get the incident declared, by which point it had forced one show to cancel
* We had limited insight into what was going on with samba/zfs, limited to htop and tcpdumps
+
* We had limited insight into what was going on with samba/zfs, limited to htop and tcpdumps, especially since...
* Took some time to find the people who knew what they were doing
+
* ... the current comp team aren't experienced with samba/zfs/backup0 and it took some time to find the people who knew what they were doing
 
* Red herrings abound (FFMPEG, RAID issues, hammering, the backup job - any or all of these could have been a red herring ''or'' a causatory factor)
 
* Red herrings abound (FFMPEG, RAID issues, hammering, the backup job - any or all of these could have been a red herring ''or'' a causatory factor)
 
* An unrelated issue with the MyRadio email system distracted the computing team
 
* An unrelated issue with the MyRadio email system distracted the computing team
Line 34: Line 44:  
=== How We Got Lucky ===
 
=== How We Got Lucky ===
   −
* After the samba and studioPCs restart, performance brought itself back down to acceptable levels
+
* We found the people who knew what they were doing, and with their help brought it under control
    
= Next Steps =
 
= Next Steps =
   −
* Continue investigating - no real conclusion as to the causes yet
+
== Short Term ==
* We Do Not Patch Shit Without Telling Anyone, and anyone who does that will be slapped around a bit with a trout
+
1. Remind presenters that computing are not infallible, and they should have all their music uploaded well in advance
 +
 
 +
:* MP - done
 +
:* If the incident reoccurs, may be worthwhile reminding them about aux playout and manual tracklisting
 +
 
 +
2. Create a process of change management / logging, to ensure everyone is on the same page about system upgrades/changes - we had no knowledge about the changes to the backup script until IL informed us on IRC
 +
 
 +
3. Stop Windows 10 from being so stupid
 +
 
 +
4. Review regularly scheduled jobs and ensure as few as possible run during the daytime
 +
 
 +
== Long Term ==
 +
 
 +
1. Upgrade Samba on urybackup0 to 4.10 - 4.6 is end-of-life
 +
 
 +
:* May need to wait until after term-time, as this is a risky process
 +
 
 +
2. Potentially move the Music Store and pool1 to separate RAID controllers in urybackup0
 +
 
 +
3. Or just take it behind the shed and put a bullet in its head - per AW, the chassis and RAID controllers are old
 +
 
 +
4. Review documentation and have URY computing alumni distill as much knowledge as possible
    
= Timeline =
 
= Timeline =
Line 45: Line 76:  
(all times GMT)
 
(all times GMT)
   −
9:00 - A regularly scheduled offsite backup job, which should be nightly and incremental... was not either of those things, for reasons unknown to man. It starts copying the entirety of the music store from backup0 to moyles (in IT Services TFTA data-center on East), which starts hammering the ZFS pools on backup0.
+
9:00 - IL, after deploying a new backup script on urybackup0, notices a bug with it, fixes it, and runs it manually. It does a full backup, not an incremental one, partly to remediate said bug. It starts copying the entirety of the music store from backup0 to moyles (in IT Services TFTA data-center on East), which starts hammering the ZFS pools on backup0.
    
20:48 - A member reports in Slack that they are unable to upload music. With just over an hour to go to their show, and 12 songs left to upload, they are forced to cancel.
 
20:48 - A member reports in Slack that they are unable to upload music. With just over an hour to go to their show, and 12 songs left to upload, they are forced to cancel.
Line 59: Line 90:  
22:36 - MP adds debug logging to the MyRadio upload controller, showing that it is hanging on a call to <code>move_uploaded_file</code>, a built-in PHP function. This leads him to point the finger to the music store.
 
22:36 - MP adds debug logging to the MyRadio upload controller, showing that it is hanging on a call to <code>move_uploaded_file</code>, a built-in PHP function. This leads him to point the finger to the music store.
   −
22:55 - AW logs on to backup0 to troubleshoot the pools, sees the degraded performance, but notes that zpool reports it is running fine. AW notices the stuck(?) offsite backup job, notes that the RAID controllers are chugging as well. MP and AW decide to restart Samba, but have to wait until midnight as there is a show on air
+
22:55 - AW logs on to backup0 to troubleshoot the pools, sees the degraded performance, but notes that zpool reports it is running fine. AW notices the sluggish offsite backup job, notes that the RAID controllers are chugging as well. MP and AW decide to restart Samba, but have to wait until midnight as there is a show on air
    
23:13-23:25 - AW notices that a Samba log is generating 100KB/s, and a 5MB log size limit means that it is rotating the log every few seconds. He also notes that the ARC cache is very low and that swap is not enabled. AW notes that the incremental backup job... is not incremental... and has been sending the entire music dataset to moyles for over 12 hours. Around this time, a very frustrated MP goes for a run to let off steam.
 
23:13-23:25 - AW notices that a Samba log is generating 100KB/s, and a 5MB log size limit means that it is rotating the log every few seconds. He also notes that the ARC cache is very low and that swap is not enabled. AW notes that the incremental backup job... is not incremental... and has been sending the entire music dataset to moyles for over 12 hours. Around this time, a very frustrated MP goes for a run to let off steam.
   −
23:35 - IL logs on and reports that the backup was ''intentionally'' not an incremental one (for reasons that the author (MP) does not yet understand). At 23:38 he kills it.
+
23:35 - IL logs on and reports that the backup was ''intentionally'' not an incremental one. However, he notes that it isn't pulling enough data to be the sole cause of the slowdown.
    
''23:41 - MP gets lost in Alcuin.''
 
''23:41 - MP gets lost in Alcuin.''
Line 80: Line 111:     
01:26 - MP and the gang head to bed.
 
01:26 - MP and the gang head to bed.
 +
 +
Around 05:00 - the backup script finishes and network and I/O performance returns to normal.

Navigation menu