Reliability under abnormal conditions

Preparing Systems for the ‘100-year wave’

Keeping complex distributed systems available to service customer requests under peak load is hard. The challenge is exacerbated by a number of factors: the combination of increasing number of services, servers and external integrations and the rapid pace of new feature delivery; heavy spikes in load during annual peak periods; and traffic anomalies driven by promotions and external events. Luckily, there are strategies that support your ability to serve your customers and keep generating revenue by limiting the impact of problems — even if it is not feasible to reduce the risk to zero.

Here’s the thing: in distributed systems, or in any mature, complex application of scale built by good engineers … the majority of your questions trend towards the unknown-unknown. Debugging distributed systems looks like a long, skinny tail of almost-impossible things rarely happening. You can’t predict them all; you shouldn’t even try. You should focus your energy on instrumentation, resilience to failure, and making it fast and safe to deploy and roll-back (via automated canaries, gradual rollouts, feature flags, etc). — Charity Majors

Breaking down the problem

The two major dimensions to address are: preventing as many issues from arising as possible; and then limiting the impact of issues that do arise. Prevention is often described as increasing mean time between failures (MTBF) and mitigation is decreasing mean time to recovery (MTTR), though time may not be as important a measure as impact on revenue or customer experience — more on that later.

For both prevention and mitigation, there are cost/benefit trade-offs. Cost is measured not just in dollars, but also in the delays to push out new features — an opportunity cost. Ultimately, every organization needs to make its own judgement about the service level it’s willing to commit to, given the cost implications of achieving that service level. Even so, most organizations will strive to continuously lower the cost of supporting their desired service level. This article explores the various strategies and techniques for doing that.


Most prevention techniques involve testing the system, or parts of it, before releasing to production. Major categories to cover include: testing for functional correctness; ability to perform under expected load; and resilience to foreseeable failures.


Mitigation involves limiting the breadth of impact, mainly through architectural patterns of isolation and graceful degradation of service, and limiting the duration of impact by improving time to notice, time to diagnose and time to push a fix.


There are also some hybrid strategies that straddle prevention and mitigation. Canary releasing to a subset of users is a type of prevention strategy, but performed in production with the impact heavily mitigated. Likewise, the advanced technique of Chaos Engineering is an approach for testing and practicing prevention and mitigation approaches in a production environment.

The following diagram outlines the major categories:
overview mindmap

Continue reading “Reliability under abnormal conditions”

Leap-frogging the Unicorns

or disrupting the disruptors

Keeping up with the Cambrians

I recently saw a chart that plotted the occurrence of the phrase “exponential growth” in published works over the last decades. Unsurprisingly the chart showed an exponential curve. Similarly I have started to notice a Cambrian explosion of “Cambrian explosions” … (the Cambrian Explosion was a phase in our geological record where there was an apparently very rapid increase in the diversity of life forms on Earth). I’m seeing the term applied in a broad variety of technology fields right now: as I cycle to work every day I’m seeing a Cambrian explosion of personal propulsion devices including electric skate-boards, power-assisted bicycles, hover-boards, scooters and obviously electric and potentially self-driving cars; in my day job we’re seeing a Cambrian explosion in tools and techniques to make data-centers ever more powerful and reliable (it’s not just the jobs of commercial drivers that are under threat from the new algorithms, sysadmins are endangered too); you just need to browse through Kickstarter or Indiegogo to see the explosion in ingenious ideas about how to graft ubiquitous connectivity and embedded smarts into every day objects; and while we’re at it we’re seeing a Cambrian explosion in terms to describe the ecosystem of all these smart connected devices.

Not disruptive

Contrary to popular opinion the likes of Uber and AirBnB are not disruptive innovators. At least not in the technical sense. These “unicorns” are clearly having a “disruptive” impact in the colloquial sense to their respective industries. But if we remove the label of “disruptor” and examine how they have succeeded we may get a better insight into how to replicate their successes, or even improve on them. Particularly if we broaden our remit to focus on solving not just for friction-reduction at the individual level, but also at the societal.

Continue reading “Leap-frogging the Unicorns”

The Green Shoots of Fair Data

“Privacy is dead – get used to it!” This is the common wisdom you’ll hear if you spend much time hanging out near Silicon Valley, reading about the latest application of predictive analytics to improving customer loyalty, or following the most recent start-ups who are busy wiring up every corner of the world to the growing Internet of Things. I spend my time doing all those things, but I don’t accept the common wisdom – I want to explore with you why I believe that reports of the death of privacy are much exaggerated. And I want to explore how there may be viable and differentiating advantages for organizations to pursue a different path.

The Data Economy

It’s clear that we’re living in a burgeoning data economy and that this economy is driven by technology. Moore’s Law rattles on apace and in its wake new generations of devices and sensors are making more and more areas of the physical world addressable by compute. We’re experiencing a self-enforcing cycle: advances in technology extract ever increasing oceans of data from the world and its inhabitants; this data is used to tailor ever better digital products and services; these improved products in turn generate more profit which is then funneled back into R&D to drive new technological advances and so the virtuous techno-utopian cycle keeps turning.

This cycle has a secondary engine whipping it along faster and faster: as we create better products the loyalty and trust of customers grows and their willingness to share ever more data increases. The implicit bargain that modern organizations are making with their customers is: “give me your data and we will give you delightful services.” Even if customers don’t explicitly state their acceptance of this bargain, their tacit acceptance of the deal drives the conventional wisdom that privacy is, to all intents and purposes, dead. For as long as we lap up ostensibly free services such as Gmail, Facebook and Dropbox, that are funded by the data and insights they can extract and sell to advertisers, there will be no impetus to search for an alternative to the conventional wisdom. Similarly we’re seeing frenetic competition to customize recommendations (and potentially pricing) for customers of retail, travel and media products.

Continue reading “The Green Shoots of Fair Data”

Web 2.0 created Surveillance 1.984

Web 2.0 has had a massive impact for good on the lives of modern humans. Web 2.0 has also been complicit in ushering in the most advanced, pervasive and Orwellian surveillance state ever witnessed by humanity. You could say that Web 2.0 created Surveillance 1.984.

How might we retain the benefits of a hyper-connected and computer-augmented society without being constantly watched by people whose interests may not always directly align with ours? How can we use technology to fashion a future that we actually want to inhabit?

The full details of the monitoring apparatus that the NSA, CIA and other “security” agencies have constructed are still trickling out from the cache of documents released into the wild by Edward Snowden. What has become clear is that every action performed in the digital arena, whether it be sending an email, making a phone call, browsing a website, tweeting an opinion, buying an item, taking a photo or just moving around with a phone in your pocket, can, and usually is, being intercepted, stored and mined for information. The technologies and services that allow us to be constantly connected to information, colleagues, friends and loved ones at the same time allow the government to snoop on private citizens in an unprecedented, unrequested and effectively unregulated manner.

Read the rest of my article on Medium


Evolving for multiple screens (Video)

Here’s a video of a recent talk I gave with my colleague Stew Gleadow in Sydney and Melbourne in Australia at our ThoughtWorks Live event in May.

It looks at strategies for successfully evolving mobile services and applications over time across a range of screens and platforms. We delve into some case studies on an Australian broadcaster’s second-screen application and a cross-platform approach for a major airline.

Beyond Mobile, Part 2: Thriving in the Shattered Future

This article was originally published by InformIT and can be viewed on their site. It is reproduced here with kind permission.

Part 1 of this series examined the explosion of mobile and embedded devices that characterize our future, explored the challenges posed by these changes, and considered a methodology for reliable innovation in this environment and the technology enablers required to support that approach. In part 2, we look at what types of strategies are likely to be effective in this new world.

Visionary Strategies

Once you have a reliable methodology in place for fostering innovation and engaging the market, supported by the technology enablers mentioned in part 1, you are finally ready to start growing and developing visionary strategies to help you capitalize on the emerging world of ambient computing.

The big question becomes, “What should our vision and strategy be?” Unfortunately, there’s no stock answer I can prescribe (though I’ll be happy to help you figure it out), but I do have some pointers toward directions you should be considering.

The growing ubiquity of computing and omnipresent interfaces points to opportunities such as “any customer, anywhere,” and the explosion of profiling data opens up services based on the idea that “we know what you’re about to think.” The key is not what your exact vision is, but how you validate it and course-correct based on that feedback. This in itself is the strategy of rapid product evolution for which part 1 of this article attempted to lay out the foundations.

Continue reading “Beyond Mobile, Part 2: Thriving in the Shattered Future”

Beyond Mobile, Part 1: Surviving the Shattered Future

This article was originally published by InformIT and can be viewed on their site. It is reproduced here with kind permission.

The world is changing, and we all need to prepare for it. The proliferation of mobile devices we are witnessing right now, and the associated challenges related to creating applications that work across those devices, are just the thin end of the wedge of what the future holds. Cisco predicts that by 2020 each of us will own an average of 6.58 connected devices. People are interacting with organizations and services with an ever more diverse set of technologies, they are doing this in a growing number of contexts, and the data being created is growing exponentially. In two-part series, we’ll look at strategies for not just surviving (part 1), but thriving in and capitalizing on the opportunities provided by our hyper-connected future (part 2).

A Shattered Future

If we look closely at the technology trends, of which mobile is just one part, it becomes clear that we are witnessing a shattering of input and output mechanisms. In the past, interactions with computers have been through fairly narrow channels. The vast majority of inputs have historically been via keyboard, and outputs were predominantly through a single fixed screen. That simple past and the strategies we developed to operate in that world are no longer useful guides to the future. We are witnessing an explosion of channels for interacting with computers. Those channels are no longer tightly coupled to each other, and even the concept of “a computer” is being blown away.

Continue reading “Beyond Mobile, Part 1: Surviving the Shattered Future”

Bring your own device … as long as it’s HTML5

As we talk with clients and prospects in the market we’re seeing a steady growth in interest around BYOD (Bring Your Own Device). This trend to allow employees to bring their own hardware (predominantly mobile phones) is putting new stresses and strains on existing IT infrastructure, operations and development practices. There are many pitfalls to watch out for, but if executed successfully, embracing the consumerization of enterprise IT can pay dividends by re-engaging a jaded work-force, simplifying cumbersome workflows and offering a launchpad to a next generation of more supple, usable and maintainable software.


The wave is inevitable

The days when organizations could mandate a limited set of issued (or supported) devices and provide access to services that were designed more around the constraints of existing IT than the users’ needs are ending abruptly. Organizations that are hesitating to overhaul their approaches are finding that employees are quickly finding ways to circumvent existing procedures and systems. It used to be the case that the software and hardware that enterprises offer their employees tended to be superior to what they encountered at home. With the advent of services like GMail, Dropbox and Skype and of hardware like the iPhone and iPad those days are well and truly behind us. Organizations that don’t respond swiftly to embrace this trend are finding themselves saddled with a disgruntled and unproductive workforce and a growing security attack surface as their employees find work-arounds to shoe-horn their favorite tools into their work lives.

BYOD introduces many challenges – security of services and data is high on the list as is distribution and provisioning along with exposing key systems, like email and calendar, to a range of native applications. However this article is focused on the challenges involved in building or migrating applications to work on a variety of devices and a range of contexts.

Continue reading “Bring your own device … as long as it’s HTML5”

Dealing with creaky legacy platforms

The following article, written by myself and my colleague, Matt Simons, was published in the December 2010 issue of the Cutter IT Journal and is re-produced here with kind permission. It was also the subject of a talk we delivered in Santa Clara.

The landscape is changing

Since the dawn of the software era, systems have generally followed a lifecycle of develop/operate/replace. For the type of systems our company, ThoughtWorks, specializes in (typically built over the past 10-15 years), organizations expect as much as 5-10 years between significant investments in modernization. And some of the oldest core systems have now reached 40+ years – far longer than the average life-span of most companies today!

IT assets are relatively long-lived largely because modernization often represents a significant investment that doesn’t deliver new business value in a form that is very visible to managers or customers. Therefore organizations put off that investment until the case for change becomes overwhelming. Instead, they extend and modify their increasingly creaky platforms by adding features and making updates to (more or less) meet business needs.

For decades, this tension between investing in modernization versus making incremental enhancements has played out across technology-enabled businesses. Every year some companies take the plunge and modernize a core system or two, while others opt to put yet another layer of lipstick on the pig.
Continue reading “Dealing with creaky legacy platforms”