Incident Report: 20200304 and Website History: Difference between pages

From URY Wiki
(Difference between pages)
Jump to navigation Jump to search
 
No edit summary
 
Line 1: Line 1:
{{Incident
Here's a potted '''history''' of the URY '''website''', courtesy of the Wayback Machine.
  |brief=Dante is hard, mmkay?
  |severity=High
  |impact=High (Approx 15mins of dead air, one ruined live show, also server reboot)
  |start=04/03/2020 21:17
  |end=04/03/2020 21:58
  |mitigation=See body
  |leader=Marks Polakovs (MP)
  |others=Matthew Stratford (MS), Isaac Lowe (IL), Michael Grace (MG), Jacob Dicker (JD), Jess Schofield (JS)
}}


''' Note: report is still a work-in-progress, expect changes. Contributions are welcome, please inform MP. '''
== c.1999-Oct 2003 ==


= Summary =
[[File:ws1.png]]


While comp team were investigating CPU load issues on Dolby, the Stores RedNet 3, for seemingly unrelated reasons, drops off the network. This has been seen before. What hasn't been seen before is what happened next: after humming along perfectly happy for 9 minutes, the Dante network started re-electing clock master every 15 seconds, which lead to drop-outs. While trying to fix that, the clock input is switched from BNC to ADAT, which really upsets the Scarlett 18i20, eventually requiring a full reboot of Dolby to get back on air.
The earliest version of the website available on the Web Archive was definitely a product of its time, with the bright orange branding of that era prominent throughout and a very 90s GIF-based sidebar on the left.


== What Went Well ==
It even had a guestbook, with some rather ''interesting'' contents.


* Given the scale of the hardware issues experienced, the fact that we had a total of 15 minutes of dead air is not ideal but could be a lot worse.
At one point in 2000, [[Gavin Atkinson]] updated the site.
:* However, a lot of that was off-air-loop, sine-wave, or Flagship News Sosij, which, while ''technically'' not dead air, isn't exactly quality broadcast output. TODO exact numbers, but somewhere around 10m of that.


== What Did Not Go So Well ==
This design was created by [[Leo Warner]], and doesn't really work too well in 1080p.


* Took us a while to realise that the RedNet had failed (especially since the error is layer 7 it wouldn't trigger monitoring as that does layer 4)
=== Webcasting ===
* Switching clock input to ADAT was a mistake and delayed recovery quite substantially
* Little insight into what happened with the Scarlett, we saw some Jack errors, diagnosed probably hardware/driver error, and made call to reboot system in order to get back on air
:* journald is not set to persistent, so the only post-hoc logs we have are from after the reboot
* TX OB was delayed


== How We Got Lucky ==
At the start of Web Archive captures of the URY website, URY were still broadcasting only on 999kHz and did not yet simulcast on the Internet; however, by 2003, URY had leapt forward into the Internet Age by hosting a worldwide live stream... using ''RealPlayer''.  Oh well...


* Almost all current experienced computing team members were in the station and ready to respond quickly
== Oct 2003-Summer? 2006 ==
* Clocking resolved itself, sooner or later, as we have no idea what actually caused it


= Analysis =
[[File:ws2.png]]


The most likely culprit in this scenario is some kind of hardware issue, potentially electrical fault in the RedNet 3. This would explain why it was cycling between link DOWN and UP on the network - it's possible that shifting logic levels inside confused the ethernet controller and/or sent invalid Layer 1 signals. It's possible that during the clocking storm (21:29-21:32) the RedNet was disappearing from the network, triggering a Dante clock master election, then re-appearing, establishing itself as master, before leaving again. (This does not explain, however, why the master would cycle between Phil, Office, and StudioRed, as one would hope Dante can at least pick one as master and stick with it.) This would also explain the hot cable - normally a word clock signal is a mere 5V, not even close to sufficient for heating it up. The noise heard during the clocking storm also sounds digital, almost like dropping packets. Anecdotally, we've had issues with RedNet power supplies in the past.
A radically new website design was launched in time for Autumn term 2003, featuring for the first time what seemed to be sensible web design (for it was a new millennium and the days of gaudy sidebars and orange on grey were far behind the URY computing team, in all their wisdom).


= Actions =
The guestbook and RealPlayer streams were still there, though.


* MP to remove song that caused Campus Playout bug and high Dolby CPU usage from rotation - '''Done'''
This website ''does'' work quite well in 1080p, considering.
* MP to set up emergency back-up audio player on BSOD - '''Done'''
* MS to finish hardware dead-air detector - '''Done'''
* JS and BA to contact Focusrite Support about RedNet issues - '''Done'''
* BA to replace clock coax with high-quality SDI cable
* MP to set up Journald persistence on Dolby (and ideally all Debian boxen) - '''done''' on Dolby, on hold on rest
* Purchase alternative Dante<->ADAT interface - '''Done'''
* MP and MS to sort out SelectorListener startup - on hold until off-air


= Timeline =
The then Head of Production, Simon Taghioff, was instrumental in this overhaul.


[https://docs.google.com/spreadsheets/d/1P61vGuJai7zTar07iOwK8H1sNvb5J2WGZ78o6hJRBiA/edit#gid=0 All logs]
== 2006-2010 ==


19:00-21:00 MS and MP investigating CPU issues on Dolby.
[[File:ws3.png]]


21:00-21:10 sporadic bits of dead air on AM & online (some presenter error, some potentially Dante related)
A minor update of the previous website, with even more orange... and no guestbook in sight!  RealPlayer by now had been joined by MP3 and Ogg Vorbis streams as URY's streaming technology marched on.


21:17:00 '''MG reports “big issue” in Slack'''
What the ''hell'' is that font on the advertising banner?


21:18:35 Dante Controller reports “Device 001DC10208C1 (device name not known) is now a grandmaster” - the RedNet 3 has dropped off
This design was jiggled around a bit over its four years of service, but remained mostly the same.


: In retrospect, I (MP) don't think this was the entry that confirmed R3 dropoff, as the timestamp was just when Dante Controller was started. However, it's hard to pin down the exact time, as the switch logs show the R3 flapping on and off the network repeatedly.
== 2010-2011 ==


21:18:36 MP screams “DANTE!”
[[File:URYsite09.png]]


21:19-21:20 '''MP turns the RedNet 3 off and on again'''
In what was probably the most short-lived (and expensive!) of website designs, URY got [http://www.freelancegraphicdesigner.co.uk/ury-web-design.html a professional graphics designer] in to completely redesign the website in conjunction with URY's comprehensive rebranding.


21:20 Dante starts flapping clock, but seemingly stabilises
The result was a lovely set of graphics (lovely being subjective on whether or not you like Impact as a font), but the code for the website wasn't as lovely.  According to legend, the site was programmed in under a week to meet harsh deadlines and was therefore effectively hacked together.  Despite all this, it worked for a year and as of writing the code is still there in heavily modified form.


21:28 The AM feed becomes noticeably sosig, slowly degrades quality
Sources indicate that a DaveX was responsible for the coding.


21:29-21:32 The AM goes to '''DEAD AIR''' with occasional bursts of techno. At the same time, Dante Controller reports '''various clock switches and devices muting''' (which is Dante-speak for “oman i am no good with audio pls to halp”). The clock master cycles between Phil, Office, and Studio Red several times. MP says in Slack at 21:31 “[Dante] is not in a good place right now.”
== 2011-2012 ==


: Right as the clocking storm starts, the switch starts reporting <code>IFNET Error LINK_UPDOWN GigabitEthernet1/0/11 link status is DOWN.</code> - port 11 is uryStores. Later <code>IFNET Error LINK_UPDOWN GigabitEthernet1/0/11 link status is UP.</code> and then DOWN again, ad infinitum.
[[File:ws2011.png]]


: Relevant log entry: <code>2020-03-04 21:30:53 GMT 1583357453638 Information "uryStores" "Timed out 3 times sending message 'UpdateRxChannels' to uryStores, giving up."</code>
The current website was largely the result of a rehashing of the design from last year by the combined efforts of [[Darren Webb]] and [[Rob Stonehouse]] on design and [[Matt Windsor]] on programming (which mainly involved tidying up the previous round of code and implementing the design changes in HTML5 and CSS).


21:33:34 MP, MS, BA set Studio Blue to clock master. Dante sort-of stabilises except not really
This website won a YUM award in 2011.


21:36:21 '''IL reports that the clock coax between uryStores and Scarlett is hot to the touch.'''
There's still no guestbook.


: "Boiling" was the word he used during the debrief - "not enough to burn me, but certainly enough to be concerning".
== 2012-2013 ==
[[File:ws2012.png]]


21:38:25 '''we go to DEAD AIR''', and at 21:39:14 Dearie-Me triggers.
In October 2012, the website was completely replaced with a shinier, newer, completely re-written site based on Django (a Python web framework). Despite the shiny new design, we immediately regretted this decision. The site was put live before it was ready - features were missing and never were fully implemented on this generation, and large amounts of it relied on a completely new database schema, so all of the Members' Internal website tools broke with the replacement. It suffered in service for less than a year before it was retired on August 2013.


Around 21:37-39 MP uses RedNet Control to switch the RedNet 3’s clock input to ADAT, believing this will help things. This does not make the Scarlett happy.
== 2013-2018 ==


: JD speculates that this was because, when we switched clocking to ADAT, the RedNet was trying to feed the Scarlett clock via ADAT at the same time that the Scarlett was trying to feed the RedNet clock via BNC (the audio equivalent of two DHCP servers on one network).
Sticking with the Python, Matt Windsor again went on an endeavour for a better website. With an entirely new codebase in Pyramid (another Python web framework) and SQLAlchemy, and a few shinifications to the actual design itself, this site went into production in August 2013, at the same time as our upgrade to Apache 2.4 and the replacement of Members' Internal with MyURY. Over the remainder of the Summer Holidays, MyURY was expanded to ensure it had capabilities to actually maintain this website, and so shiny Banner and Podcast systems were available and the site once again looked pretty.


Around this time IL attempts switching out the coax, but fails when he realises that the replacement cable is a piece of shit (frayed ends), and, after noticing we'd gone to dead air, switched it back.
There's still no guest book, but there is a sign up form on the Get Involved page.


: During the debrief IL reported that, looking at the front panel of the Scarlett, no lights were on - not even the AM return feed, which should have at least some signal (modulation noise) even in the event of dead air
== 2018-Present ==


During this time IL and BA are setting up a TX OB (in layman's terms, shove a microphone directly into the transmitter to get *some* signal on air). Much running between office and Stores to gather equipment ensues.
''For a picture, load up [http://ury.org.uk ury.org.uk]!''


21:38:49 Dante Controller reports that uryStores is clock master, as the team is oblivious to the ongoing Scarlett cataclysm.
The current version of the site, amusingly enough codenamed 2016-site, for it was started in 2016 but only released in 2018 (arguably still not finished...), was designed by Brooke Hatton and coded up by himself alongside (at various times) Matthew Stratford, Chris Taylor, Matt Windsor, Natalie Harris, Danny Roberts, and many others. Out went Python, and in came the modern programming language ''du jour'', Go. In between, MyURY was replaced by (read: renamed to) MyRadio, which feeds it everything - scheduling, podcasts, team info, you name it.


21:39:56 Studio Blue is switched to Clock Master, confirmed by MP in Slack at 21:40:07. Still dead air.
There's still no guest book.
 
21:42 MP tries an audio pipeline restart (aka startAudio.sh). This does not help.
 
: Gracefully stopping Jack via systemd fails and MP has to kill it, probably due to driver issues.
 
:: jack_lsp reported <code>jack_client_open() failed, status = 0x21</code>
 
: During the debrief BA speculated that, although it was showing up in as a device (although MP has no logs of this), the Scarlett had probably borked itself completely due to the double-clocking.
 
21:44 MP makes the call to reboot Dolby.
 
21:46:42 Dolby boots to Linux.
 
21:47:04 MP SSHes into Dolby and tries starting audio, failing.
 
21:49:41 '''DEAD AIR ENDS''' - goes to horrible sosig Flagship News (from the TX OB)
 
21:50:17 MP runs startAudio.sh
 
21:50:50 uryStores becomes clock master, at 21:51:09 Dante finishes re-synchronising
 
21:52:00 horrible sosig Flagship News ends and AM has off-air loop
 
21:54-21:56 MP realises that SelectorListener isn’t running, and runs some commands to try and start it. Selector cycles between Off-Air, Sine Wave, and Jukebox as MP tests that it works.
 
21:58:01 MP selects Jukebox.
 
21:58:59 '''MP declares the incident over'''
 
[[Category:Incident Reports]]

Revision as of 11:18, 19 July 2020

Here's a potted history of the URY website, courtesy of the Wayback Machine.

c.1999-Oct 2003

The earliest version of the website available on the Web Archive was definitely a product of its time, with the bright orange branding of that era prominent throughout and a very 90s GIF-based sidebar on the left.

It even had a guestbook, with some rather interesting contents.

At one point in 2000, Gavin Atkinson updated the site.

This design was created by Leo Warner, and doesn't really work too well in 1080p.

Webcasting

At the start of Web Archive captures of the URY website, URY were still broadcasting only on 999kHz and did not yet simulcast on the Internet; however, by 2003, URY had leapt forward into the Internet Age by hosting a worldwide live stream... using RealPlayer. Oh well...

Oct 2003-Summer? 2006

A radically new website design was launched in time for Autumn term 2003, featuring for the first time what seemed to be sensible web design (for it was a new millennium and the days of gaudy sidebars and orange on grey were far behind the URY computing team, in all their wisdom).

The guestbook and RealPlayer streams were still there, though.

This website does work quite well in 1080p, considering.

The then Head of Production, Simon Taghioff, was instrumental in this overhaul.

2006-2010

A minor update of the previous website, with even more orange... and no guestbook in sight! RealPlayer by now had been joined by MP3 and Ogg Vorbis streams as URY's streaming technology marched on.

What the hell is that font on the advertising banner?

This design was jiggled around a bit over its four years of service, but remained mostly the same.

2010-2011

In what was probably the most short-lived (and expensive!) of website designs, URY got a professional graphics designer in to completely redesign the website in conjunction with URY's comprehensive rebranding.

The result was a lovely set of graphics (lovely being subjective on whether or not you like Impact as a font), but the code for the website wasn't as lovely. According to legend, the site was programmed in under a week to meet harsh deadlines and was therefore effectively hacked together. Despite all this, it worked for a year and as of writing the code is still there in heavily modified form.

Sources indicate that a DaveX was responsible for the coding.

2011-2012

The current website was largely the result of a rehashing of the design from last year by the combined efforts of Darren Webb and Rob Stonehouse on design and Matt Windsor on programming (which mainly involved tidying up the previous round of code and implementing the design changes in HTML5 and CSS).

This website won a YUM award in 2011.

There's still no guestbook.

2012-2013

In October 2012, the website was completely replaced with a shinier, newer, completely re-written site based on Django (a Python web framework). Despite the shiny new design, we immediately regretted this decision. The site was put live before it was ready - features were missing and never were fully implemented on this generation, and large amounts of it relied on a completely new database schema, so all of the Members' Internal website tools broke with the replacement. It suffered in service for less than a year before it was retired on August 2013.

2013-2018

Sticking with the Python, Matt Windsor again went on an endeavour for a better website. With an entirely new codebase in Pyramid (another Python web framework) and SQLAlchemy, and a few shinifications to the actual design itself, this site went into production in August 2013, at the same time as our upgrade to Apache 2.4 and the replacement of Members' Internal with MyURY. Over the remainder of the Summer Holidays, MyURY was expanded to ensure it had capabilities to actually maintain this website, and so shiny Banner and Podcast systems were available and the site once again looked pretty.

There's still no guest book, but there is a sign up form on the Get Involved page.

2018-Present

For a picture, load up ury.org.uk!

The current version of the site, amusingly enough codenamed 2016-site, for it was started in 2016 but only released in 2018 (arguably still not finished...), was designed by Brooke Hatton and coded up by himself alongside (at various times) Matthew Stratford, Chris Taylor, Matt Windsor, Natalie Harris, Danny Roberts, and many others. Out went Python, and in came the modern programming language du jour, Go. In between, MyURY was replaced by (read: renamed to) MyRadio, which feeds it everything - scheduling, podcasts, team info, you name it.

There's still no guest book.