Reliability under abnormal conditions

Preparing Systems for the ‘100-year wave’

Keeping complex distributed systems available to service customer requests under peak load is hard. The challenge is exacerbated by a number of factors: the combination of increasing number of services, servers and external integrations and the rapid pace of new feature delivery; heavy spikes in load during annual peak periods; and traffic anomalies driven by promotions and external events. Luckily, there are strategies that support your ability to serve your customers and keep generating revenue by limiting the impact of problems — even if it is not feasible to reduce the risk to zero.

Here’s the thing: in distributed systems, or in any mature, complex application of scale built by good engineers … the majority of your questions trend towards the unknown-unknown. Debugging distributed systems looks like a long, skinny tail of almost-impossible things rarely happening. You can’t predict them all; you shouldn’t even try. You should focus your energy on instrumentation, resilience to failure, and making it fast and safe to deploy and roll-back (via automated canaries, gradual rollouts, feature flags, etc). — Charity Majors

Breaking down the problem

The two major dimensions to address are: preventing as many issues from arising as possible; and then limiting the impact of issues that do arise. Prevention is often described as increasing mean time between failures (MTBF) and mitigation is decreasing mean time to recovery (MTTR), though time may not be as important a measure as impact on revenue or customer experience — more on that later.

For both prevention and mitigation, there are cost/benefit trade-offs. Cost is measured not just in dollars, but also in the delays to push out new features — an opportunity cost. Ultimately, every organization needs to make its own judgement about the service level it’s willing to commit to, given the cost implications of achieving that service level. Even so, most organizations will strive to continuously lower the cost of supporting their desired service level. This article explores the various strategies and techniques for doing that.


Most prevention techniques involve testing the system, or parts of it, before releasing to production. Major categories to cover include: testing for functional correctness; ability to perform under expected load; and resilience to foreseeable failures.


Mitigation involves limiting the breadth of impact, mainly through architectural patterns of isolation and graceful degradation of service, and limiting the duration of impact by improving time to notice, time to diagnose and time to push a fix.


There are also some hybrid strategies that straddle prevention and mitigation. Canary releasing to a subset of users is a type of prevention strategy, but performed in production with the impact heavily mitigated. Likewise, the advanced technique of Chaos Engineering is an approach for testing and practicing prevention and mitigation approaches in a production environment.

The following diagram outlines the major categories:
overview mindmap

Continue reading “Reliability under abnormal conditions”

Don’t push requirements – pull information

I always struggled to see how what Lean teaches us about pull systems can be applied to software development processes. That was until I had an “Aha!” moment a little while ago helping a client apply lean and agile principles to their delivery process.

The big fat lie

I understand how queuing theory can help identify and reduce bottlenecks in processes and have used finger-charts and kanban-boards to do this for a while, but I still find calling this a “pull system” to be slightly disingenuous. All that’s happening is that more “stuff” is being pushed based on a trigger when certain buckets get too low. This reminds me of my annoyance with early technologies on the web that were touted as being “push” but were really just “repetitive-pull” (but not in a good way). I’ve never seen a software organization where the developers have said to the business or product people “we’ve got nothing to do, can you think up some new projects or features for us please?”.

Continue reading “Don’t push requirements – pull information”

Build Transformation across an Organization

My most recent project was helping a major online retailer to mature their build process as part of a wider effort to improve their IT effectiveness through the injection of development best practices.

When we came onboard manual intervention was needed for any of their builds or deployments to work and so it was rare for more than a couple of builds or deployments to be completed successfully in a day. Now we often have up to 1,000 builds running every day – what’s more the majority of them now pass!

This article looks at a few of the techniques we’ve had to put in place to enable this transformation and what we’ve learnt along the way.

Continue reading “Build Transformation across an Organization”

Can virtualization save the real world?

With Google measuring the efficiency of their code in the amount of gigawatts required to serve it to millions of people, optimizing applications can actually have a positive impact on the world.

Logicalis have put together some advice on how to reduce the impact of IT on the environment. The suggestions range from reducing hardware requirements through virtualization and other consolidation techniques to old favorites like double-sided printing, video-conferencing, electronic forms and turning off your desktop at night.

Continue reading “Can virtualization save the real world?”

Ant Fu

We’ve had some discussions recently about best practices when creating Ant scripts, so I thought I’d write up a few of my favourites.

Managing Ant target dependencies

“depends” are great, until your build file gets bigger than a couple of screenfuls. You can end up with a crazy spaghetti monster of dependencies very quickly. On a few builds I’ve worked on we’ve had a basic rule:

Targets can either have depends or a body, but not both.

Continue reading “Ant Fu”

Show don’t tell: Consulting with GraphViz

I’ve often found that it’s much more effective to show clients what their problems are, rather than just telling them. Recently I’ve ended up using GraphViz as a great tool for high-lighting complexity that needs to be addressed.

At the client I’m currently working for the complexity of the build scripts was getting out of hand. I wanted to goad the customer into prioritising some simplification work. So I turned to GraphViz to depict how complex the build was. The build we’re using is a large, centralised, Ant script that builds about 10 different applications. It manages everything through the process of compile, test, package and deploy.

I found the handy ant2dot.xsl tool that uses XSL to transform an Ant build file into a DOT format graph representing the flow and dependencies between the various build targets.

Continue reading “Show don’t tell: Consulting with GraphViz”