The risk of a crash or collision suddenly becomes very real. It is not good enough to have a backup system that cannot be operated for fear it might become corrupted by the same virus that has hit the frontline operating system. Safe flying demands that the backup system is discrete and capable of taking over the full load of flight information instantly.
If you hand over the information then you’re sending the same virus through, aren’t you? And if you have a duplicate system that you’ve fed, in parallel, with the same information in real time then it’s got the virus too, right?
Somehow I suspect that if Hutton was talking about France only, his opinion of the country’d be much lower.
The idiot with enthusiastic tone
Who praises every century but this,
and ever Country but his own.
It wasn’t a virus, it was an anomaly. If the NATS system shut down, then the safety measures worked to prevent the whole system being corrupted, and the fault unseen leading to calamity.
The ATC back up system is pencil and paper – as it was before computers. The only secure backup system for an electronic one, is a manual system. No planes crashed, air traffic still moved, albeit more slowly.
There is no such thing as perfect.
So the great stakeholder is now an expert on how to design distributed real time systems? The incredible extent of middle class hubris never ceases to amaze me.
Martin is right, it’s a great example of the hubris of the useless class. Most of them have never had a real job and have no record of success in anything except social climbing, yet they presume to tell the rest of us how everything should be done.
The gold standard in resilient systems is to have your backup system be different from the primary. I.e. running different software or a different architecture (ARM vs. Intel say).
Then a virus that takes down the primary may not affect the backup.
But that is really expensive to duplicate for such a rare occurrence. Mostly it is just a copy of the primary as that copes with common problems like power failure or flooding etc.
The gold standard here was to switch from computer to pen and paper. It worked…..
It’s not a virus. It’s about duff data. And software should handle it but clearly it caused some sort of crash. Happens.
And the backup won’t work because I presume the data is being replicated to it. Backups aren’t about data but hardware or Comms failures.
And really NATS have had one failure in a decade. That’s a pretty robust system. A fuckwit like Willie who couldn’t run a tiny business training company and has no idea how good that is. Crumbling my arse.
During my time in the communications industry our ‘internal’ customers demanded failproof backup systems. I would ask them if they had backup plans in place to cover building fires or gas leaks. There was no answer. It’s always easy to criticise someone else when you know nothing about the technicalities of their jobs.
Reminds me of people thinking they had backup connections to fail over to and didn’t realise that the first external infrastructure point they got to was the same so all they did was move the point of failure from inside the building to outside the building
As for crumbling schools, this was a known problem in the 90s when Blair was spunking money on various nonsense. Could have sorted the concrete problem instead of the Olympics, NPfIT, HS1.
Liberal on input, conservative on output.
Hmmm.
Maybe that’s past its use-by date.
Conservative on input, conservative on output.
Ah, but screams of BIGOTS!!! Everything must be liberal! And ENFORCED as liberal!
@Tim Worstall – “The gold standard here was to switch from computer to pen and paper. It worked…..”
It clearly didn’t. There was major disruption to flights.
And there was no suggestion of a virus being involved – that’s just a journalist’s wild imaginings.
It’s possible to have a system whereby there are three independent implementations and a voting system to decide on the correct result, but that is extremely expensive. Since the implementations must be done by totally different teams, that means you must pay for at least the cheapest, second cheapest, and third cheapest alternative bids. And there’s no guarantee that the flaw that eventually surfaces is in an implementation and not in the specification.
It is foolish to attempt to avoid all failure – just keep it down to a reasonable level.
It clearly didn’t. There was major disruption to flights.
That’s how the (manual, backup) system is designed to work. If we have to fall back to manual every 10 years or so, that may well be a sensible trade-off, rather than build extremely expensive solutions to take availability from 99.97% to 99.99%.
Charles: majority voting systems work at small scale, like fly-by-wire systems on aircraft. But not at the scale of an ATC system. As others have said, one failure of that scale in 10 years is pretty good going. The main problem was that it couldn’t be rectified before the domino effect spread it far and wide.
@ Charles
It is reported (without any dissent that I have observed) that the problem was due to a data input. How can you forecast that this data input would only affect one system AND that a flight plan that led to a collision would be rejected by a *minimum* of two out of three systems?
Your implied assumptions are just that -assumptions
It is *possible* that running three different computer systems simultaneously would reduce the frequency of such shut-downs but that is “not proven”
One of the more ironic war stories from my experience was of a customer who had both production and development deployments of our product and contacted us about problems on their development server. Ultimately it emerged that the development server machine hosted a single service instance in which they had defined three copies of their production image, one for each of their three development teams through systematic object renaming to ensure their logical segregation.
When, in addition to upgrading them to a more robust and performant version, we explained that there was an (admittedly poorly documented) ability to license and run multiple service instances on their development server, they said that they preferred to carry on the same way because having their candidate deployment image running under triple load would really set their minds at ease about promoting it from development to production.
“It clearly didn’t. There was major disruption to flights.”
The top priority for air traffic control is not killing people. Disruption is unfortunate, but not lethal. I’m guessing that the faulty message related to a flight in the air. Telling them to park up for a few hours isn’t an option.
‘The top priority for air traffic control is not killing people. Disruption is unfortunate, but not lethal.’
Very true decine. Of course, this doesn’t mean I wouldn’t bitch and moan anyway if I was delayed. But I’m like that.
One of the few organisations that had both the money to investigate multiple-independent-solutions-and-vote and the obligation to publish was NASA (was, not now).
Their papers can be found c 1980s in the professional aviation publications.
In a nutshell: it’s a waste of money. Most bugs are caused by incorrect or imprecise specifications and similar common mode failures. Independently coded systems just increase the cost, not the safety.
There’s a tale, probably apocryphal, about a software-controlled torpedo, that the Navy insisted must self destruct if it had turned more than 180 degrees after launch (to avoid seeking the vessel wot fired it). Come trials day, fire tube 1! Oh dear, no launch. Return to base to study expensive protype. So U-turn back to base. boom! The torpedo had been told it had been launched. And the software did exactly what the spec said it should do.
The training example I use is to ask people to pseudo-code up a proggie to solve quadratrics. You know the one. If the team is any good, usually, most of them will have coped with complex roots, and didn’t square root a negative number. b-squared minus 4ac, etc.
Then you give them a quadratic with a=0.
So class, whose program didn’t just divide by 0?
Perfectly valid quadratric, you of course started the task with a,b,c being any value.
One or two may survive this, so a=0, b=0, c=5.
But that’s not fair they say, your input equation is 5=0!
What, you had no input value validation checking?
Which would appear to be the case in the ATC failure (according to some reports).
Being somewhat experienced in this field, the way to have a resilient system is to have another (hot standby), and be able to very swiftly restart from a known position and carry on, while the failed side recovers, with or without repair. The service availability is then largely driven by the fault-detection coverage on the standby side, and the speed of return to service of the failed side – NOT either side’s reliability.
Eugene Kranz was right. Failure is not an Option. Failure is inevitable. So the professionals design for it.
Time for bed said Zebedee.