Black Box thinking and outages

There’s a morbid fascination here at Exonia about aviation safety. During my MBA, I shared classes with a guy that worked for British Airways. His job was to monitor all of the accident reports around the world and to figure out what may or may not need changing in processes or design. Stories about the paint on the tail through to the slack in a jacket cuff* were a feature of many post lecture drinks. One of my most watched shows used to be National Geographic’s “Air Crash Investigation”.

Actual accidents are reported in detail and changes are effected by everyone to mean they can never happen again. Simple changes like having metric and imperial instruments in different sizes so that an engineer can’t install the wrong one by mistake, through to far more fundamental requirements that lead to the iconic Concorde coming out of service are all par for the course in aviation.

In telecoms, we don’t quite think like that. There’s a fantastic book out there called “Black Box Thinking” which makes a very similar case to what I suggest here. It has a relative on the bookshelves too; “The Checklist Manifesto” takes some of the same strands but in more specific detail and is also worth a read.

We may not think telecoms is quite as life or death as Captain Sully landing his A320 on the Hudson in 2009, but at any given moment in time, how many networks are connecting those in dire need with the Emergency Services? Various parts of networks are designated as Critical National Infrastructure and the large telcos are involved in civil contingency planning and play an important part in facilitating the co-ordination of emergency organisations. Network outages are a nuisance; not being able to watch Netflix is infuriating for many but not being able to call in the cavalry when there’s an axe-murderer at your door would be distinctly more annoying.

That said, communications providers in the UK are well incentivised to keep their networks running. There’s an automatic compensation regime for loss of service or failed installations, an absolute obligation for uninterrupted access to the emergency services and hefty fines for failing that. All of that is before the reputation damage that an incident can cause; TalkTalk’s shareholders are very acutely aware of how that translates financially.

There are two cases from my own experience I would like to illustrate my point.

The first is common to any organisation – commercial pressure. A drive to cut corners to deliver quicker or lower cost service. I recall the CEO of one of my clients asking me why I was being blamed for a apparent delay in the move of the Samaritans 116123 number from another network to ours – a flagship project at the time. Well, it was quite simple. What would happen to calls in progress when the switch was made? When you’re talking about potentially suicidal members of the public, you don’t want to unilaterally cut them off by accident (or give number unavailable). Running through every single possible iteration of what might or might not happen, how it might be mitigated in the moment, took time across two network’s technical teams. Of course, the CEO instantly realised the imperative of this work and backed it 100%; the project was a complete success too. That said, the moral hazard that thankfully was nowhere near being realised should be clear to see.

The level of technical engagement between competitors to make something happen is always a surprising feature of telecommunications. Thinking back to the MBA, I am reminded of the word “co-opititon”. A look at the cases in the courts will demonstrate ever changing factions of co-operation within the industry. Network standards are often defined by the consensus of engineers after lengthy debate across providers and Ofcom has on more than one occasion reached a voluntary understanding with a group of networks.

That is something the industry should be proud of, and long may it continue.

The second story though, presents a different side. One summer, the incoming power feed to one of a client’s primary data centres was interrupted. Well, that happens. Everything started to go to plan; the air conditioning cut out to reduce the power load, the batteries took over and the generators began their start up sequence. Up until that point, it looked like a real life version of the regular disaster rehearsals.

But the generators kept resetting. All of them. They wouldn’t spin up to full power and were stuck in a loop. What was happening was, despite the design and result of prior tests to the contrary, all the air conditioning was trying to come back on as soon as there was any incoming power from the generators – that sent the demand through the roof and beyond the generators’ capability at that moment, so they reset because that’s what they were programmed to do in that situation. As a result of the air conditioning’s antics, the batteries were discharged far sooner than intended and the site went dark 13 minutes after the incident started; before the issue could be fully diagnosed.

Now, the vast majority of services failed over as planned to an alternative data centre; many seamlessly including calls in progress. In fact, the average person on the street would have been completely unaware anything had happened. In aviation terms, it was very much a near miss that only the tower was aware of.

What then happened was a detailed internal investigation, statutory reporting to the regulator, RFOs sent to customers, batteries were replaced and old ones sent to the lab for analysis, many changes to processes (like checking air conditioning start up sequences after all works on them and periodically thereafter) were implemented. That same situation shouldn’t ever occur in that network again.

So far, so good. The author of Black Box Thinking would be proud.

Except that’s largely where the story ends. Yes, I am sure engineers regale each other over a pint of real ale (apologies for the stereotype) and the knowledge is partially disseminated. Invariably, the representatives of that network working on industry standards would leverage that experience in their drafting and of course, if they move to another employer, they take it with them. If the regulator had launched a formal investigation, then eventually there would’ve been a report which you may be able to infer some specifics if anything useful was left unredacted.

If it had been a near miss involving your British Airways flight though, there would’ve been bulletins sent around the world. You can do a simple Google and find them all. Here’s Transport Canada‘s publications for example.

I wonder how many other outages there have been like the one I outlined; would it have happened if there was a global clearing house that anonymised reports (because commercial pressure is undoubtedly one of the reasons why telecommunications has a different attitude to this way of life compared to aviation)? Would the industry be better if the regulator, when opening an investigation, as they did today, gave some clue as to what happened?

For an industry that is so good at building things together, I remain surprised that there appears room for improvement on how we go about stopping things breaking. The economist in me suspects that the cost of reputation damage from exposing one’s dirty laundry to scrutiny exceeds the benefit in mitigating potential future outages; psychologically it is understandable when we tend to place a greater emphasis on the side of that equation which is rooted in fear. Also, reliability is a competitive differentiator in an ever increasingly cut-throat market; it is equally understandable there is reticence in sharing such information.

In an idle period, I may well work up something more substantive on this subject. Do feel free to get in touch if you think I’m onto something!

* If I recall the story correctly, a go-around had to be done on one aircraft because the pilot’s jacket sleeve caught one of the 4 throttle levers as he reached for another switch.

Published by Peter Farmer on September 18, 2018 January 11, 2023

Sub-million turnover attracts focus of UK security compliance

Cost is no object

Government approval needed for changes of control of UK telcos.

Black Box thinking and outages

Published by Peter Farmer on September 18, 2018 January 11, 2023

Related Posts

Sub-million turnover attracts focus of UK security compliance

Cost is no object

Government approval needed for changes of control of UK telcos.