Reliability under abnormal conditions

Preparing Systems for the ‘100-year wave’

Keeping complex distributed systems available to service customer requests under peak load is hard. The challenge is exacerbated by a number of factors: the combination of increasing number of services, servers and external integrations and the rapid pace of new feature delivery; heavy spikes in load during annual peak periods; and traffic anomalies driven by promotions and external events. Luckily, there are strategies that support your ability to serve your customers and keep generating revenue by limiting the impact of problems — even if it is not feasible to reduce the risk to zero.

Here’s the thing: in distributed systems, or in any mature, complex application of scale built by good engineers … the majority of your questions trend towards the unknown-unknown. Debugging distributed systems looks like a long, skinny tail of almost-impossible things rarely happening. You can’t predict them all; you shouldn’t even try. You should focus your energy on instrumentation, resilience to failure, and making it fast and safe to deploy and roll-back (via automated canaries, gradual rollouts, feature flags, etc). — Charity Majors

Breaking down the problem

The two major dimensions to address are: preventing as many issues from arising as possible; and then limiting the impact of issues that do arise. Prevention is often described as increasing mean time between failures (MTBF) and mitigation is decreasing mean time to recovery (MTTR), though time may not be as important a measure as impact on revenue or customer experience — more on that later.

For both prevention and mitigation, there are cost/benefit trade-offs. Cost is measured not just in dollars, but also in the delays to push out new features — an opportunity cost. Ultimately, every organization needs to make its own judgement about the service level it’s willing to commit to, given the cost implications of achieving that service level. Even so, most organizations will strive to continuously lower the cost of supporting their desired service level. This article explores the various strategies and techniques for doing that.

Prevention

Most prevention techniques involve testing the system, or parts of it, before releasing to production. Major categories to cover include: testing for functional correctness; ability to perform under expected load; and resilience to foreseeable failures.

Mitigation

Mitigation involves limiting the breadth of impact, mainly through architectural patterns of isolation and graceful degradation of service, and limiting the duration of impact by improving time to notice, time to diagnose and time to push a fix.

Hybrid

There are also some hybrid strategies that straddle prevention and mitigation. Canary releasing to a subset of users is a type of prevention strategy, but performed in production with the impact heavily mitigated. Likewise, the advanced technique of Chaos Engineering is an approach for testing and practicing prevention and mitigation approaches in a production environment.

The following diagram outlines the major categories:
overview mindmap

Cost/benefit analysis

Most of the prevention strategies have a similarly shaped cost/benefit curve, so that while early investments provide good results, you hit a point of diminishing returns in the value and viability of investing in catching more obscure issues in complex corner-cases.
Confidence versus cost
Figure 2: Levels of confidence gained from pre-production testing

Example 1: a system that connects with many partners to get real-time pricing information started receiving malformed XML responses from one partner. That caused an infinite recursion in an open source XML parser, which maxed out CPUs on multiple servers. The effort required to test enough permutations of malformed XML to catch this framework issue ahead of time was extreme.

Example 2: the server-side content caching of one application saw a Thundering Herd problem, when all search engines started crawling the content in production. The problem was not caught in performance testing because the problem occurred only when the cache eviction timing coincided with search engine requests for content.

As systems grow more complex — in terms of number of components, infrastructure, users, integrations and features — the cost of ensuring a fixed level of correctness and resilience increases to the point that the trade-off becomes less worthwhile.

Complexity versus cost
Figure 3: Maintaining level of confidence as system complexity grows

Example 3: companies like Facebook and Uber have such scale and complexity in their production environments that attempting to replicate production for load testing is unrealistic. They tend to lean more heavily on strategies for testing performance in production using techniques, such as canary releases, to minimize the impact of problems.

Overall recommendations

Our overall recommendations are:

  • Apply rigor to pre-production testing, but be aware of the curve of diminishing returns
  • Use layered strategies to improve the cost/benefit equation
  • Investigate options for load testing safely in production
  • Focus on mitigating the impact of problems in production by both containing how far problems can spread and improving speed of noticing, diagnosing and responding to issues

Pre-production testing

While not all potential problems can be caught within reasonable budgets and timescales through pre-production testing, we still advise investing effort into mitigating basic risks through rigorous automated testing for correctness, performance and resilience, before releasing to production.

Functional testing

This article isn’t primarily concerned with the functional correctness of a system, nevertheless good unit and functional tests can head off many resilience problems by testing how components handle various error scenarios and whether they degrade gracefully.

Load testing

In pre-production testing, you want to ensure that code changes haven’t had a negative impact on (localized) performance and also to get a baseline for capacity planning — by understanding the load an individual node can support and then the scaling efficiency of adding more resources. The first two can be covered by running basic load testing in your CI/CD pipeline. Understanding the linearity (or lack thereof) in the scaling characteristics of your components will require a more specific set of tests, ones that observe the impact on throughput of your components as you add nodes, CPUs or other resources.

For load testing pre-production, you can use a tool like Gatling or Tsung to generate reasonably realistic loads against a full or partial set of deployed services. We recommend running these types of tests as part of your build pipeline, but since they likely take a while to run, they can often be run out of band in a fan-out/fan-in manner.

CD pipeline
Figure 4: Recommended tests for your build pipeline

It is useful to test out your monitoring/observability capabilities during these load tests to check that you can identify the cause of bottlenecks. For example, network call monitoring (or code profiling) can catch n+1 problems in chatty service or database calls.

Capacity planning

With the advent of auto-scaling, it is often assumed that systems can scale linearly by adding extra nodes, as long as you follow some basic 12-factor approaches. Sadly, this is rarely true, so it is important to understand how the performance of your system, or elements of your system, responds to the addition of more resources. It is rarely a straight (linear) line, it will normally top out at some point through contention for resources; it may well start actually degrading as cross-talk (chattiness) increases quadratically with the addition of more nodes. Understanding and applying the Universal Scaling Law, potentially using tools like USL4J, can help with capacity planning and fixing issues that are leading to sub-linearity in scaling.

Resilience testing

Testing how resilient a system will be to unknown issues is a tough problem. We mentioned how elements of resilience can be tested as part of your unit and functional tests, if you include suitable tests for predictable “unhappy paths”. In part two, we’ll also look at testing resilience in production with Chaos Engineering approaches. There are a few other failure modes that can be tested relatively easily before getting to production. Your unit and functional tests should test that your application responds to network failures or errors in upstream systems in a graceful way that doesn’t propagate and amplify failures through the system.

There are some other categories of problems like slow networks, low bandwidth, dropped packets and timeouts that can be simulated pre-production, but aren’t always caught by unit testing. Network conditioning tools can help simulate these issues. This is particularly useful for testing mobile applications, but can also be applied to inter-service communications.

Layered strategies

Multiple approaches can be layered to improve your chances of success. Key areas to investigate are:

  • Use contract testing to reduce the amount of end-to-end integration tests needed
  • Use service virtualization (e.g. mountebank) to reduce the number of services that need to be deployed for a performance test and to allow you to simulate downstream latency using record and playback

Layered strategies
Figure 5: Layering multiple approaches delivers more benefit

Testing in production

Given that setting up and maintaining fully production-like environments can get ever more costly and problematic, there are various approaches for using production to gain confidence in how your evolving system will respond under load.

Canary releases

A standard approach for using real production traffic and infrastructure is Canary releasing. It was pioneered by the likes of Google and Facebook. By testing changes under a very low percentage of traffic, which is then incrementally increased as confidence grows, many categories of performance and usability problems can be caught quickly.

Canary releasing has a few prerequisites and some limitations:

  • It requires reasonable levels of traffic to be able to segment in a manner that provides useful feedback. Ideally, you can start with safe internal “dogfood” users
  • It demands solid levels of operational maturity, so that issues can be spotted and rolled-back quickly
  • It only gives you feedback on standard traffic patterns (during the canary release period) and gives you less confidence about how the updated system will operate under abnormal traffic shape or volume driven by promotions or seasonal shifts.

Load testing in production

If your production environment is large and expensive to provision, or your data is hard to recreate, realistic pre-production testing is difficult. , Your first step should be to reduce the cost of recreating production, but failing that, there are various approaches for load-testing in production.

One approach is to run ‘synthetic transactions’ in large volume to ‘hammer test’ components of your system and seeing how they respond. Another approach is to store up asynchronous load and release it in a large batch to simulate heavier traffic scenarios. Both of these approaches require reasonable levels of confidence in your operational ability to manage these rushes in load, but they can be a useful tactic in understanding how your system will respond under serious load.

Example 4: We would hold production traffic in our queues for a fixed time, typically 6–8 hours (enough to create necessary queue depth for a load test), release it and then monitor application behavior. We built a business impact dashboard especially to monitor business-facing metrics. We took utmost care to not affect business SLAs when doing so. For example, for our order routing system we would not hold expedited shipping orders, as it would affect the shipping SLAs. Also, we had to be aware of downstream implications of the test — essentially we were load testing the downstream systems as well. For instance, in our case it would be the warehouses that would receive a lot of requests for shipments at the same time. This would put pressure not only on their software systems, but also their human systems like the workforce in the warehouse.

Chaos engineering

In many ways, the gold standard for resilience testing is called “Chaos Engineering”. This uses deliberate fault-injection to verify that your system’s resilient to the types of failures that occur in large distributed systems. The underlying concept is to randomly generate the types of failures you feel your system needs to be resilient to and then verifying that your important SLOs aren’t impacted by these generated failures.

Obviously embarking on this approach without high levels of operational maturity is fraught with danger. But there are techniques for progressively introducing types of failures and performing non-destructive dry-runs to prepare teams and systems.

We strongly believe the right way to develop the required operational maturity is to exercise this type of approach from the early stages of system development and evolution so that the observability, monitoring, isolation, and processes for recovery are firmly set in place before a system becomes too large or complex for easy retro-fitting.

Mitigating impact in production

Certain categories of problems are still going to make it to production, so it’s important to focus on two dimensions of lowering the impact of issues:

  • Minimize the breadth of impact of issues, mainly through architectural approaches
  • Minimize time to recover through monitoring, observability, ops/dev partnership, and CD practices

Minimizing risk
Figure 1: Minimizing impact of problems

Minimizing the blast radius (breadth of impact)

Minimizing the breadth of impact of a problem is largely an architectural concern. There are many patterns and techniques for adding resilience to systems, but most of the advice can be boiled down to: having sensible slicing of components inside your system; having a Plan A and a Plan B for how those components interact; and having a way for deciding when you need to invoke Plan B (and Plans C, D, E, etc).

The book “Release It!” has a great exploration of the major patterns (timeouts, bulkheads, circuit-breakers, fail-fast, handshakes, decoupling, etc) and how to think about capacity planning and monitoring. Meanwhile, tools like Hystrix can help you manage graceful degradation in distributed systems — but such tools don’t remove the need for good design and ongoing rigorous testing.

Shortening time to recovery

If we accept that problems are inevitable in a complex system, then we need to focus on how to rectify them quickly when they occur. We can break this down into three areas: time to notice, time to diagnose, and time to deploy a fix. ThoughtWorks has been advising teams for a while to focus on mean time to recovery over mean time between failures.

Continuous Delivery

Time to push a fix once a problem has been diagnosed is largely about engineering lead time. ThoughtWorks has long advocated the benefits of continuous delivery. There’s no need to revisit the principles and practices here. But it’s worth considering how you can us CD to optimize the process of going from identifying a defect to pushing a change.

Monitoring and alerting

Monitoring is a detailed topic, but the basic aim is to gather the information that enables you to notice (and be alerted) when part of the system is no longer functioning as expected. Where possible, the focus should be on leading indicators that alert you before an issue causes noticeable effects. It’s useful to monitor at three different levels:

  • Infrastructure: disk, CPU, network, IO
  • Application: request rate, latency, errors, saturation
  • Business: funnel conversions, progress through standard flows, etc.

A useful approach for measuring the negative impact of new features or code changes is monitoring key business metrics and watching for deviations from standard behavior.

Observability

‘Observability’ is a concept that’s gathering traction. It focuses on designing applications which emit all the information required to rapidly discover the root cause of problems — and not just expected problems (known unknowns), but also the unexpected problems (unknown unknowns) that are typical of the types of issue encountered in complex distributed systems.

Tooling and practices in this space are still evolving, but having good approaches for structured logging, correlation IDs, and semantic monitoring are good starting points, as is the ability to query your timeseries logging and monitoring data in a correlated fashion through tools such as Prometheus or Honeycomb.

Starting early to avoid the S-curve

Earlier, we talked about the cost-benefit curves of the various tactics for gaining confidence in how your system will operate under abnormal load or unexpected conditions. We drew a nice convex curve to illustrate the trade-offs. But this nice curve is based on the assumption that the cost of getting started with a technique is low. The longer you wait to introduce a practice, the greater the ramp-up will be and the convex curve will turn into more of an S-curve.

Early adoption versus retrofitting
Figure 2: Starting practices early and late

For this reason, we recommend starting as many practices (load testing, observability, chaos engineering, etc) as early as possible. We think retrofitting a practice or test suite is likely to incur high costs.

Conclusion

Accepting and preparing for unanticipated combinations of problems is a fact of life in modern complex systems. However, we believe the risks posed by these unknown unknowns can be heavily reduced by applying a combination of approaches:

  • Stopping predictable problems through pre-production testing, layering together multiple approaches to gain the maximum benefit from the effort applied
  • Limiting the impact of problems that occur under standard usage by carefully phasing in releases, through incremental roll-outs and providing architectural bulwarks to prevent proliferation of problems
  • Preparing to respond to unanticipated problems by developing mature operational abilities to spot, diagnose and fix issues as they arise
  • Applying hybrid techniques, such as chaos engineering, to both identify potential issues before they arise and at the same time test out your operational maturity
  • Lowering the cost of these approaches by thinking about and applying them from early in the development lifecycle

This article was originally posted on the ThoughtWorks insights channel. Many thanks to my colleagues who provided insights and feedback on earlier drafts: Srinivasan Raguraman, Zhamak Dehghani, Linda Goldstein, Joshua Jordan, Praful Todkar, Brandon Byars, Unmesh Joshi, Ken Mugrage, Bill Codding, Matthew Moore, and Gareth Morgan.