Hhhm, yes, but, umm, no

In BA’s case, the UPS in question delivers power through the mains, diesel and batteries.

On Saturday morning, shortly after 8.30am, power to Boadicea House through its UPS was shut down – the reasons for which are not yet known.

Under normal circumstances, power would have been returned to the servers in Boadicea House slowly, allowing the airline’s other Heathrow data centre, at Comet House, to take up some of the slack.

But, on Saturday morning, just minutes after the UPS went down, power was resumed in what one source described as “uncontrolled fashion.” “It should have been gradual,” the source went on.

This caused “catastrophic physical damage” to BA’s servers, which contain everything from customer and crew information to operational details and flight paths. No data is however understood to have been lost or compromised as a result of the incident.

BA’s technology team spent the weekend rebuilding the servers, allowing the airline to return to normal operations as of today.

Umm, they didn’t have a 100% mirror, entirely redundant back up, system on another site, on another power supply?

They didn’t?

Oh….

33 comments on “Hhhm, yes, but, umm, no

  1. “power was resumed in what one source described as “uncontrolled fashion.””

    Some twat of a manager shouting down the phone and overriding people with actual knowledge, no doubt.

  2. Yep, remember working for one like that. He was much like I’d expect Gordon Brown to be.

    He *helped* during major problems by shouting things like “this needs fixing quickly!” at people who were already busy fixing, then standing immediately behind them breathing heavily.

    Also the many pointless demands for updates, which drag said fixers away from the things they are fixing to talk about fixing instead of doing it.

  3. Some twat of a manager shouting down the phone and overriding people with actual knowledge, no doubt.

    “Before I destroy the data centre, I want that order in writing. On paper. Signed.”

  4. Do you remember that ad – I think it was for IBM – where the chairwoman calls a board meeting, and sets out the problem with everything down. From memory it went: “The software guys say it’s a hardware problem; the hardware guys say it’s the software. Whose responsibility is it to sort this out?” And the chap next to her leans over and quietly says: “That would be you”.

  5. Mains charges the batteries, mains go out, batteries take over until the diesel can be fired up, then we discover some fuckwit insisted his aircon was on the uninterruptible supply…

  6. Some twat of a manager shouting down the phone and overriding people with actual knowledge, no doubt.

    All the people with actual knowledge took redundancy months if not years ago. The trouble with failures that only occur every few years is that you need staff memories that are longer than that.

    I’ve done enough ISO 22301 audits to know that you can usually find single points of failure, even in the best-run organisations. But this does seem to be at least two unrelated failures of Business Continuity – first the UPS, then the failover (and then the recovery).

    #EPICFAIL

  7. My office had a total loss of power due to a flooded substation. We in one part of it were oblivious though, turned out we were sitting in the space where the emergency IT support staff used to sit, and all the emergency power was diverted to us (and the server room of course) so we could continue solitaire uninterrupted until it came back on.

  8. @Chris Miller

    Following your comment about folks taking redundancy and buggering off, I googled “British Airways IT outsourcing”.

    Might not be the direct cause. Might be a contributing factor (morale, insecurity, the experienced hands bugger off before having to go through the HR ritual humiliation of training their replacements).

  9. I’m not posting very well, sorry. The quote should be:

    But significant criticism has been leveled at the airline due to the fact that in recent years, UK-based IT staff have been laid off and replaced with Indian replacements from Tata Consultancy Services, a major outsourcing company.

    In 2016, British Airways slashed 700 jobs in the UK. According to Mick Rix, a representative of trade union GMB, five of these worked in the equipment and facilities team at the facility that experienced the power surge.

  10. Cynic>

    If you want to live up to your name, you might like to think about why the far-righters at the GMB like to pretend outsourcing is only (or even predominantly) to India. It isn’t, but that story appeals to racists. In fact, Tata will have been employing people in this country, as well as some in India, to meet this contract.

    The reality is that it’s not insourcing or outsourcing that matters, but competent management. If you have it, you can control the quality of both internal and external workers. If you don’t, you can’t control either.

    Bigger picture here is that we don’t yet know what went wrong and whether BA were just plain unlucky, or it was a combination of a small error and bad luck, or they absolutely blundered.

  11. Was about to read that, then saw it was by Dave so skipped it.

    Can anyone else be bothered?

  12. Having a 100% redundant mirror for ALL your IT systems gets bloody expensive bloody quickly. Not having a 100% redundant mirror for critical systems – the ones that generate you revenue and keep aeroplanes in the sky instead of on the tarmac – is thoroughly irresponsible.

    In this day and age, having your own datacentre is widely regarded as a bad idea. Everyone else has caught on to the idea that division and specialisation of labour is spectacularly important here – especially seeing as they can help you answer the redundant system questions.

  13. I suppose it’s one of those things that some higher-ups don’t understand the value of until it goes wrong. ‘Cos right up till you need it, well-rehearsed business continuity is just money thrown away.

    (That said, having a good DR and recovery in place should make the insurance cheaper)

    Another thing that can happen, much like maintenance work such as patching, is that it gets skipped over because it isn’t showy. There is a bias in damagement toward doing “profile raising” work as a priority.

  14. Cynic, Dave has a point. A company outsourced IT to Tata. My ex-wife, who is Welsh and lives in Wales, now works for Tata. Disaster recovery is not one of the functions that usually relocates to India, for example. So the GMB complaint sounds fishy to me

  15. Was about to read that, then saw it was by Dave so skipped it.

    Can anyone else be bothered?

    I wrote a Chrome extension that automatically strips out the worst of the windowlickers. I’ve pasted it up here before, can make it available again if anyone’s interested.

  16. Bit of a guess, but given the talk of power surges and outages;

    I suspect that the fault was within the UPS itself, that for some reason the power monitoring kit decided to switch to battery and fire up the diesels, but the resulting supply was out of phase for the load. Various bits of kit start falling over or dying, so when the supply gets back to normal, the load is still unbalanced, so more gear fails to restart correctly, or remain up for long.

    Assuming they resolve that, then they’ve probably got issues getting kit to re-sync and roll forward from the last coherent checkpoint, depending when that was. At some point, given they’ve sorted the hardware failures, they just have to halt operations and wait until every damn thing is at a consistent state, before beginning operations again.

    Seems possible, depending, that this type of failure wouldn’t necessarily fail-over to any off-site mirrors anyway.

  17. “the far-righters at GMB”

    Ah Dave–when Ken Dodd goes you will be the last survivor of the Great British Clowns.

  18. Even if they did have a DR process, it’s a bit of a leap in the dark to test. A true test has to be done on the live system and that’s squeaky bum time for managers when the failover might itself fail. That seems to happen often enough to be a worry. That’s not an excuse not to do it but it takes a manager with large brass ones to push the button.

  19. Watch out for the website above –its seems a bit unstable.

    Indian software engineering perhaps?

  20. @Dio

    The GMB complaint sounded fishy to me, too. But it did refer to job losses in the affected function, so was relevant, though qualified.

    “Being from a union, I can’t say this is the most reliable source, but:”

    I assume the response to that was something along the line of cut-n-paste “WAYCIST” trolling, so I didn’t bother reading it, as I’ll have read the same thing too many times already. Judging by the responses by the folks that did read it, I was correct to do so.

    (I come here to be educated and entertained, so I’ve started skipping over the obvious poor trolling crap like Newmania and Dave.)

  21. @Dio

    You know what, my comment wasn’t even about India. It was about outsourcing and the effect on morale.

    The outsourcing just happened to be to India.

    I’d’ve copy-pasted the quote the same if it had been outsourced to Capita.

    Interesting what bits people pick up on most.

  22. It’s what outsourcing does to the chain of management. If there’s an internal crisis, you call for all hands on deck to fix it. If the crisis is at a supplier, you’re entirely dependent on them to fix it.

    Also, as a supplier they’ll have less knowledge of your business processes. An internal IT team can say to management “if you don’t mind losing the baggage system for another 24 hours, we can get planes back in the air now”; whereas an external IT team can only say if it’s working or not working. This depends on the precise nature of the working relationship: some companies are far more closely integrated with their suppliers.

    Ideally you want to have something of a revolving door of staff between your IT people and the suppliers, just to keep that kind of knowledge alive.

  23. 1) Think very carefully about adding browser extensions from some random bloke on the internet. Study the code before installing it, or get someone to explain it to you.

    2) Create a directory somewhere, put this in a file called manifest.json: https://filetea.me/n3wD056BHnuTZxKrpkW5G2ZUQ

    3) In the same directory, put this in a file called arsewipe.js: https://filetea.me/n3wqOuCrxxdSPyTEJc4PSz55Q

    4) Go to chrome://extensions/ , click “Load unpacked extension” (you may need to enable developer mode first for this to show up), and select the directory you created in step 2.

    5) When Chrome has installed the extension, reload any page on Tim’s blog to filter out the comments from people you just skip anyway.

    6) To customise the list, edit arsewipe.js in the obvious place, reload the extension on the extensions page, then reload the blog page.

  24. @AM

    Yep, with an outsourcer you can’t really complain if it’s still down but within SLA – they’ve got other customers to look after too.

    In-house, you can tell the guys “bollocks to that, get it up as quickly as you can. Ignore all other calls and tasks”.

  25. TCS from the corrupt third-world country India? Superb! BA, like in British Airways, outsourcing anything at all to India? What can go wrong? I mean with “great, smart, cheap” monkeys?

  26. In this day and age, having your own datacentre is widely regarded as a bad idea.

    Mostly by managers who get a fat bonus for ‘saving money’ and will be out the door before the whole thing crashes and burns.

  27. @Edward M. Grant: No, not really. By managers who realise that having a few in-house employees looking after some racks of servers in a building somewhere isn’t quite as safe, secure, reliable or anything else as a multi-billion dollar company who employ more people to restock the bullets in the security guards’ guns than they do to look after the datacentre.

    Having your own datacentre is a *terrible* idea. That needs to be left to the people who really, really know what they’re doing.

  28. Had a UPS bypass switch die on me one night (in the middle of an unrelated thunderstorm), inform management that everything is dead and ring outsourced Tech Support.

    TS hear the word switch and say we need an electrician,

    I explain that the switch in question is a box of electronics and that mains is fine and they then deny we’ve got one and call a spark anyway.

    I work out that I can probably get systems up and running on mains alone if I’m allowed to run a flying lead to bypass the switch and UPS (the system won’t be protected but at least it will be running).

    It took 5 hours before TS got hold of an electrician who would let me plug something in.

    Every couple of weeks an electrician would turn up to fix ‘a broken switch’, nobody in TS knew what a UPS Bypass switch was or what it was for (It allows you to hot swap the UPS batteries without rebooting the whole system,eventually I had to send links to the manufacturers product page).

    It took 11 months to get a new one fitted, I got the engineer to open up the old one and it was immediately obvious that a ribbon cable to the control panel had come loose, refitted the cable and it worked fine.

Leave a Reply

Name and email are required. Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.