Is going wrong a right of passage in technology?
It’s not often you want to air your mistakes or errors in public. An HBO intern recently sent a test e-mail to the production list and Fastly allowed a customer configuration change to disrupt service. Speak to anyone that’s worked with production technology systems for any length of time and you’ll hear of at least one major clanger.
In my own career I’ve seen the dd command destroy a VM’s disk image rather than take a backup, a major campus network knocked out by a simple memory allocation change and a closing contact mis-wire combining with numerous miscommunications turning into an ad break playing over a Remembrance Sunday silence on the radio. These stories plus the ones you’ll no doubt hear from colleagues or trade press can promote the rather glib question “if you’re not breaking things, are you really working”?
The ideal, modern world of CI/CD suggests that we can de-risk those changes by making them smaller and more regular. We can also catch a lot of the more obvious or predictable errors before the happen. Completely breaking a production environment shouldn’t be possible.
Except it is. Much in the way there’s always a “better idiot”, there’s always a way to get through a barrage of reviews, tests and even change boards. The obvious disaster in hindsight arrived to through the Swiss Cheese model. No blame reviews are supposed to prevent these problems ever happening again. Though it would be a first to see a truly “no blame” environment.
Things get even murkier with that legacy system built on old practices and a “that’ll do” attitude or underlying infrastructure that has slowly evolved over time. While it is almost always technically possible to have a production like test environment, the economics might not allow for it. Using network equipment as an example, whitebox switching / routing platforms can be run up in a virtual environment and configuration changes tested within.
However, that £1m+ 10 year old core switch is a trickier prospect. Even with the modern whitebox / SDN solution, the virtualised switch may not operate identically to the hardware ASICs. It’s less scary now but it’s not impossible to run into edge cases you “should have known about” before making the change. That unexpected null pointer exception will still crash the product.
Whether in development or operations encountering such an incident changes behaviours. Yes, new practices are put in place to prevent it ever happening again but organisations can become more risk averse over time. Why would a change board approve something similar to a previous major incident? You can cite all sorts of preventative measures but still get rebuffed on a confidence basis. All this while management are chasing you for a looming deadline.
Inter-team confidence can also diminish. Who’s going to trust the operations team that took out the virtualisation environment? Who’s going to trust the development team that pushed a showstopper bug into production that woke everyone up at 3am?
And that’s the crux of it, people are going to make mistakes. We can mitigate them as best we can and prevent them happening in the future but, we’re human. Computer systems are here to support human endevours. This includes all the complexity, bending of rules and someone having a bad day at the office (virtual or in person) that comes with the territory.
To return to the opening question, it’s less a right of passage and more of a learning exercise. It can and will happen to anyone that’s around long enough. It will definitely knock confidence in both them and the organisation. Much like HBO’s scenario, the organisation has to take it on the chin and keep going.
Much like avoiding walking down a specific path after a dog bite, putting off similar changes only generates more pain down the line. It could almost be considered a form of (non-?)technical debt. The balance between recklessness and reclusiveness can be found, but we’ll need to make a few mistakes along the way. As someone who’s made more than their fair share… I’ll be the first to admit it’s not going to be easy.
On that note, what mistakes or clangers have you dropped along the way in your career?