Incident Report: 20191104 and Incident Report: 20191029: Difference between pages

From URY Wiki
(Difference between pages)
Jump to navigation Jump to search
No edit summary
 
No edit summary
 
Line 1: Line 1:
{{Incident
{{Incident
   |brief=We DDoS'd ourselves
   |brief=Jukebox forgets how an audio happens. and suddenly remembers after scaring the Chief Engineer
   |severity=Moderate
   |severity=Moderate
   |impact=Medium (Manual uploaded non-functional, one canceled show as a result - only a few seconds of dead air though)
   |impact=High (Approx 1hr of dead air)
   |start=04/11/2019 09:00
   |start=29/10/2019 11:02
   |end=05/11/2019 02:00
   |end=29/10/2019 12:26
   |mitigation=Stop the self-DDoS, restart samba, cry
   |mitigation=Make sure the dead air alarm actually alerts when dead air happens
   |leader=Marks Polakovs (MP)
   |leader=Marks Polakovs (MP)
   |others=Matthew Stratford (MS), Isaac Lowe (IL), Anthony Williams (AW)
   |others=Morgan McKay(MM)
}}
}}


=URY is FUCKED (Frustratingly, URY Can't Keep Electronics Deferential)=
Report under construction.


'''This is still under construction because this literally just happened. When finished, I'll move this page into mainspace.'''
=Summary=
An (As of yet) not pinpointed issue caused Jukebox to fail to output to TX and Online for over an hour, despite the jukebox service still playing out music on Icecast and through the studio faders. Also the dead air alarm didnt tell anyone, the bastard.


== Summary ==
=Rough Sequence of Events=
At around 11:02 Jukebox stops playing out audio to streaming and TX. Nobody notices as nobody listens to us, and the dead air detector sends no emails (Not that it could do much other than "We detected a silence, jukebox has been switched to jukebox!")


A combination of a regularly scheduled offsite backup job and some strange behaviour by the studio PCs caused urybackup0 (specifically the music store) to have severely degraded performance. This in turn meant that music uploads to the central library, which normally take less than a minute, would take far longer, sometimes reaching 10 minutes and timing out. One show had to be canceled as a result because the host was not able to get all his music onto the system in time. The incident was not resolved, but mitigated by restarting Samba and the studio PCs.
Having finished his morning lectures at around 12:00, MM enters the station and has a quick check of Jukebox via a desks AM feed. MM notices that no sound is coming out. Checking the feed from the AM2, Yamaha, and Icecast show the same. In a nervous panic MM alerts the committee, switching to red and playing out the Jukebox feed via the Pres Pc (Through Icecast, as Jukebox still was playing audio itself) at 12:11. At the time MM didnt know online was off as well as AM so ran into TX and stared at it. The transmitter was sitting there all happy and smug, no reflected power to speak of, just the ominous "Audio break" LED latched on. Having no idea what to check, MM comes back to the studio after a good 10 minute stare and contemplate to find that Jukebox is back on air playing as normal.  


== Timeline ==
Turns out during this thoughtful TX stare someone came in and thought "Why the fuck is Red on thats not right!" and switched to Jukebox, which suddenly worked as normal. MM is very confused at this and declares the incident over at 12:26, unsure if the incident was over (But the dead air was gone and MM had shit to do so eh, good enough).


(all times GMT)
== What Went Well ==


9:00 - A regularly scheduled offsite backup job, which should be nightly and incremental... was not either of those things, for reasons unknown to man. It starts copying the entirety of the music store from backup0 to moyles (in IT Services TFTA data-center on East), which starts hammering the ZFS pools on backup0.
After the incident was declared, partial recovery only took five minutes and a complete recovery took 20.


20:48 - A member reports in Slack that they are unable to upload music. With just over an hour to go to their show, and 12 songs left to upload, they are forced to cancel.
== What Did Not Go Well ==


21:32 - MP alerts #computing in Slack to a serious incident. '''INCIDENT BEGINS.'''
The dead air was unnoticed for over an hour, due to a failure in the dead air alarm


21:35 - MS reports that he had done some package updates on dolby(?) over the weekend, including ffmpeg, which may be causing stuck uploads. (This is eventually shown to be a red herring, but is important to include in this report.)
== How We Got Lucky ==


''Between these entries, the computing team are sidetracked by an unrelated issue with MyRadio emails''
* The Chief Engineer was in the station when he thought to check AM and noticed the dead air - had it been a presenter who may not have checked, the dead air may have been far longer
* Jukebox/Liquidsoap/whatever broke unbroke itself


22:30 - MS reports that ffmpeg conversions are working manually.
=MP's Investigation=
After running out of his seminar on East, and getting stopped dead in his tracks by First Bus, MP started pulling all the logs he could - namely the HQ audio logs and the Liquidsoap logs. From this he was able to establish the following timeline (all times GMT):


22:36 - MP adds debug logging to the MyRadio upload controller, showing that it is hanging on a call to <code>move_uploaded_file</code>, a built-in PHP function. This leads him to point the finger to the music store.
== Timeline ==
 
22:55 - AW logs on to backup0 to troubleshoot the pools, sees the degraded performance, but notes that zpool reports it is running fine.


22:59 - AW notices the stuck(?) offsite backup job
Dead air begins: 11:02:30


23:04 - AW notes that the RAID controllers are chugging as well
MM reports dead air in Slack: 12:06


23:10 - MP and AW decide to restart Samba, but have to wait until midnight as there is a show on air
Dead air ends with MM's presPC bodge: 12:11:20


23:13 - AW notices that a Samba log is generating 100KB/s, and a 5MB log size limit means that it is rotating the log every few seconds. He also notes that the ARC cache is very low and that swap is not enabled. Around this time, a very frustrated MP goes for a run to let off steam.
Jukebox back on air via selector: 12:26:12


23:25 - AW notes that the incremental backup job... is not incremental... and has been sending the entire music dataset to moyles for over 12 hours
Total dead air: 1:08:50


23:35 - IL logs on and reports that the backup was ''intentionally'' not an incremental one (for reasons that the author (MP) does not yet understand). At 23:38 he kills it.
== Significant realisations ==


''23:41 - MP gets lost in Alcuin.''
* Jukebox continued to tracklist even throughout the dead air
* The selector logs showed nothing out of the ordinary - at 00:02:12 output was switched to Jukebox, where it stayed until 12:11:20 when MM switched to Red to end the dead air
* The dead air was heard on both AM and online - it appears clearly in the HQ log
* The dead air alarm... did not alarm about the dead air
* When MM switched output to Red, it worked without any problems


23:49 - MS and IL report that the Windows 10 photos app is seemingly hammering backup0's shares
== Conclusions ==


00:05 - MP returns and stops samba. Performance on backup0 immediately jumps tenfold. However, MP screws up and doesn't realise that samba also handles his credentials (from the Windows domain).
The most likely culprit is Liquidsoap hitting one of its numerous bugs, getting stuck, and failing to output audio until the switch to Red unstuck'd it. The dead air alarm is part of <code>sel.liq</code>, so it not going off implies that the whole script broke, which would make sense considering Jukebox continued to tracklist (implying that it thought it was on air).


00:10 - MS logs on with his local account and restarts samba. Performance degrades immediately. MP notes that Xymon does not monitor samba on backup0 at all.
Other potential culprits are the Jack connection from Liquidsoap to the sound card, or the sound card itself, or TX itself, but all of these are unlikely:


00:14 - AW notes that the picture hammering noted earlier is likely one of the studio PCs.
* Jack is relatively stable software
* While Focusrite does not officially have Linux drivers, their cards are actually often more stable on Linux than on Windows
* TX is hard to break, plus, the dead air was on both AM and online, which makes a physical signal path break unlikely


00:38-00:43 - MP remotes into wogan and restarts all the studio presenter and guest PCs. At 00:56 he reports that performance has not, in fact, improved substantially.
= Further Work =


01:06 - MS tests song uploading and notes that, while it is still slow, it has improved dramatically. '''INCIDENT ENDS'''
* Create a more reliable dead air detector
** in progress - MP working on DEARIE ME (Disappointingly Empty Airwave Remediation and Incident Escalation Management Engine), a dead air alarm listening to the AM feed
* Upgrade liquidsoap in the hopes of fixing bugs
** Will need to wait, as this is a risky process


01:26 - MP and the gang head to bed.
[[Category:Incident Reports]]

Revision as of 14:10, 5 November 2019

Incident Report
Jukebox forgets how an audio happens. and suddenly remembers after scaring the Chief Engineer
Summary
Severity Moderate
Impact High (Approx 1hr of dead air)
Event Start 29/10/2019 11:02
Event End 29/10/2019 12:26
Recurrence Mitigation Make sure the dead air alarm actually alerts when dead air happens
Contacts
Recovery Leader Marks Polakovs (MP)
Other Attendees Morgan McKay(MM)


Report under construction.

Summary

An (As of yet) not pinpointed issue caused Jukebox to fail to output to TX and Online for over an hour, despite the jukebox service still playing out music on Icecast and through the studio faders. Also the dead air alarm didnt tell anyone, the bastard.

Rough Sequence of Events

At around 11:02 Jukebox stops playing out audio to streaming and TX. Nobody notices as nobody listens to us, and the dead air detector sends no emails (Not that it could do much other than "We detected a silence, jukebox has been switched to jukebox!")

Having finished his morning lectures at around 12:00, MM enters the station and has a quick check of Jukebox via a desks AM feed. MM notices that no sound is coming out. Checking the feed from the AM2, Yamaha, and Icecast show the same. In a nervous panic MM alerts the committee, switching to red and playing out the Jukebox feed via the Pres Pc (Through Icecast, as Jukebox still was playing audio itself) at 12:11. At the time MM didnt know online was off as well as AM so ran into TX and stared at it. The transmitter was sitting there all happy and smug, no reflected power to speak of, just the ominous "Audio break" LED latched on. Having no idea what to check, MM comes back to the studio after a good 10 minute stare and contemplate to find that Jukebox is back on air playing as normal.

Turns out during this thoughtful TX stare someone came in and thought "Why the fuck is Red on thats not right!" and switched to Jukebox, which suddenly worked as normal. MM is very confused at this and declares the incident over at 12:26, unsure if the incident was over (But the dead air was gone and MM had shit to do so eh, good enough).

What Went Well

After the incident was declared, partial recovery only took five minutes and a complete recovery took 20.

What Did Not Go Well

The dead air was unnoticed for over an hour, due to a failure in the dead air alarm

How We Got Lucky

  • The Chief Engineer was in the station when he thought to check AM and noticed the dead air - had it been a presenter who may not have checked, the dead air may have been far longer
  • Jukebox/Liquidsoap/whatever broke unbroke itself

MP's Investigation

After running out of his seminar on East, and getting stopped dead in his tracks by First Bus, MP started pulling all the logs he could - namely the HQ audio logs and the Liquidsoap logs. From this he was able to establish the following timeline (all times GMT):

Timeline

Dead air begins: 11:02:30

MM reports dead air in Slack: 12:06

Dead air ends with MM's presPC bodge: 12:11:20

Jukebox back on air via selector: 12:26:12

Total dead air: 1:08:50

Significant realisations

  • Jukebox continued to tracklist even throughout the dead air
  • The selector logs showed nothing out of the ordinary - at 00:02:12 output was switched to Jukebox, where it stayed until 12:11:20 when MM switched to Red to end the dead air
  • The dead air was heard on both AM and online - it appears clearly in the HQ log
  • The dead air alarm... did not alarm about the dead air
  • When MM switched output to Red, it worked without any problems

Conclusions

The most likely culprit is Liquidsoap hitting one of its numerous bugs, getting stuck, and failing to output audio until the switch to Red unstuck'd it. The dead air alarm is part of sel.liq, so it not going off implies that the whole script broke, which would make sense considering Jukebox continued to tracklist (implying that it thought it was on air).

Other potential culprits are the Jack connection from Liquidsoap to the sound card, or the sound card itself, or TX itself, but all of these are unlikely:

  • Jack is relatively stable software
  • While Focusrite does not officially have Linux drivers, their cards are actually often more stable on Linux than on Windows
  • TX is hard to break, plus, the dead air was on both AM and online, which makes a physical signal path break unlikely

Further Work

  • Create a more reliable dead air detector
    • in progress - MP working on DEARIE ME (Disappointingly Empty Airwave Remediation and Incident Escalation Management Engine), a dead air alarm listening to the AM feed
  • Upgrade liquidsoap in the hopes of fixing bugs
    • Will need to wait, as this is a risky process