Everyone is an expert when things go wrong

As a technologist, it’s not a lot of fun having your platform or product in the media for all the wrong reasons.  From nationwide platform failures to widespread privacy breaches, as your fault, someone else’s fault, or through some strange combination of bad luck; The sunken feeling when you turn up to work realising you’re probably going to need a whole lot more coffee, really sucks.

When things go wrong, everyone becomes an expert.  ​And by everyone, I mean everyone: Inside or outside your organisation, your peers, the press, your partner, and even your parents.  As genuine as some of those comments might be, it’s hard to be appreciative when the world is falling down around you.

It seems like that to the peanut gallery, everything that you’re doing and have done is wrong.  And in a perverse kind of way, they’re right…  As recent events unfolded, I didn’t want to join in on that particular chorus, but I did want to share with you just some of the challenges the guys that build and run these things actually face:

It’s always more complex than you might think and usually less complex than you make it: People love a silver bullet solution; That one magic fix that will cure all the worlds woes.  We often start out with something that looks quite elegant under most circumstances, but then the real world catches up with a litany of edge cases and conditions that start to distort your “perfect” solution.

Soon that solution becomes so convoluted, its sheer complexity becomes a risk in its own right and that other “less perfect” solution doesn’t look so bad after all.  ​You need to be able to see all of the forest, even despite all the trees, casting aside any assumptions you might have had.  A rare skill in any scenario but especially if you’ve become emotionally invested or lack all the information.

Your customer models are wrongI don’t care how much customer modelling, survey’s, quantitative testing, or data analytics you’ve done, your model is wrong.   Customers always behave in ways you didn’t expect.  The reasons are wide and varied including your model averaging them out, the information you started with was wrong or out of date, or between creating the model and launching, your customers had simply moved on.

​You have to take every customer behaviour model to heart, but each digested with an appropriate pinch of salt.  Even after taking everything into account, the real world of launch day can still be an eye opening experience.

Your solution is wrong: if your customer model is wrong, then chances are the solution you’ve built to service them will be as well.  Moreover, there is always something you haven’t considered; the unknown unknowns as it were.   Whether blinded by a missing piece of knowledge, experience, or some other cognitive bias, there are always going to be some flaws in your solution somewhere.

Of course the question is where?  Team diversity is key here, as are experience and time, but that’s another topic in of itself.

Your testing is wrong:  With testing, you can only test what you know based on the information, model, and solution you have.  There are also limits on what you can test and how relevant that test actually is.  Platform size, cost, time and other limitations mean you generally have to be selective about what you can do, interpreting or extrapolating your results as best as you can.

This applies especially to performance and reliability testing.   Even if your model was right (which it won’t be) It’s nearly impossible to accurately test the behavior of a large scale system under full load.  You’re probably going to test a subset of behaviors on a subset of the platform, over a relatively short period of time, which may not extend well to the real world.  Equally, failure testing is more often than not performed under “sunny day” controlled scenarios using only the failure modes you knew about in the first place.

​Which leads me to my last point:

Your understanding of the platform is wrong:  It’s perhaps an obvious statement that technology platforms generally don’t fail in ways you would expect.  After all If you had expected it to fail in that way, then you would probably have designed a fix for it right?  Either you’ve underestimated the probability of an event, disregarded it, or just don’t fully understand how some aspect of the solution works.

​As systems age, they can also start to fail in all sorts of weird and wonderful ways.  Some of those failures will be predictable, or with luck, have been already experienced by someone else.  From time to time, you can still find yourself on the end of a phone call with your supplier without anybody having any idea of what to do next.

​Whatever the case,  no one really knows what they don’t know.  You can’t know everything and chances are, no matter how well you think you understand it, you probably don’t.  At least for some aspect…​

In Short…

Hopefully you can see that in the world of large, high profile technology platforms, the odds for success are really stacked against you.  The business and technology risks are significant with little room for error on anyone’s part.  ​​ When it comes to resiliency and performance,  even some of the biggest names in the industry are not immune to public failure.  Experience and time here are your only friends.

​With each new catastrophic failure to hit the press, there can be a strong temptation for many to speculate over the causes and what could be done differently.  There is really only one conclusion you can reliably make.  Catastrophic failure only happens because someone hasn’t identified or managed all the risks.  For a failure to hit the press, then the risks at play and their consequences were missed on multiple levels.

​Whatever you do, don’t berate the teams behind it.  In hindsight, everyone thinks they are a star quarterback.  I also have no doubt that behind each event, that very same team of guys and gals are working tirelessly to restore your service.  Most will be emotionally invested in its success.

​And finally, if in the heat of the moment, you’re assuming anyone would be immune from same problems, then well…  you’re probably wrong.  ​

Perhaps as technologists, it would save us a lot of time if we all assumed we are wrong at the outset and acted accordingly.  Our technology products and services might spend a little more time in the papers for good reasons, rather than bad ones.

Image of Spruce Goose copyright Eric Salard