Part I Intruduction

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
-- Principles of Chaos

Using Chaos Engineering may be as simple as manually running kill -9 on a box inside of your staging environment to simulate failure of a service. Or, it can be as sophisticated as automatically designing and carrying out experiments in a production enviroment against a small but statistically significant fraction of live traffic.

The History of Chaos Engineering at Netflix: started in 2008

Chaos Monkey: ball rolling, gaining notoriety for turning off services in the production environment
Chaos Kong: transferred those benefits from the small scale to the very large
Failure Injection Testing (FIT): the foundation for tackling the space in between

1. Why Does Chaos Engineering?

How Does Chaos Engineering Differ from Testing?

The primary difference between Chaos Engineering and these other approaches is that Chaos Engineering is a practice for generating new information, while fault injection is a specific approach to testing one condition.

Tests are typically binary, and determine whether a property is true or false. Strictly speaking, this does not generate new knowledge about the system, it just assigns valence to a known property of it. Experimentation generates new knowledge, and often suggests new avenues of exploration

Examples of inputs for chaos experiments
Simulating the failure of an entire region or datacenter.
Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in production.
Injecting latency between services for a select percentage of traffic over a predetermined period of time.
Function-based chaos (runtime injection): randomly causing functions to throw exceptions.
Code insertion: Adding instructions to the target program and allowing fault injection to occur prior to certain instructions.
Time travel: forcing system clocks out of sync with each other.
Executing a routine in driver code emulating I/O errors.
Maxing out CPU cores on an Elasticsearch cluster.

Prerequisites for Chaos Engineering

Answer one question: Is your system resilient to real-world events such as service failures and network latency spikes?

If you are certain that a Chaos Engineering experiment will lead to a significant problem with the system, there’s no sense in running that experiment. Fix that weakness first.

Another essential element: A monitoring system that you can use to determine the current state of your system.

2. Managing Complexity

Software engineers typically optimize for three properties: performance, availability, and fault tolerance. Engineers at Netfilx also consider a fourth property: velocity of feature development

Netflix chose to adopt a microservice architecture. Let us remember Conway’s Law:

Any organization that designs a system (defined broadly) will inevitably produce a design whose structure is a copy of the organization’s communication structure.
—Melvin Conway, 1967

An Example About "Bullwhip Effect"

Bullwhip effect: A small perturbation in input starts a self-reinforcing cycle that causes a dramatic swing in output.

The most important feature in the example is that all of the individual behaviors of the microservices are completely rational. Only taken in combination under very specific circumstances do we end up with the undesirable systemic behavior. Chaos Engineering provides the opportunity to surface these effects and gives us confidence in a complex distributed system.

Chaos Kong

While the monkey turns off instances, Chaos Kong turns off an entire Amazon Web Services (AWS) region, ensuring that the service is resilient to an outage in any one region, whether that outage is caused by an infrastructure failure or self-inflicted by an unpredictable software interaction.

Part II The Principles of Chaos

Short version: Principles of Chaos Engineering

The performance of complex systems is typically optimized at the edge of chaos, just before system behavior will become unrecognizably turbulent.
—Sidney Dekker, Drift Into Failure

Hypothesize about Steady State

Our goal in identifying steady state is to develop a model that characterizes the steady state of the system based on expected values of the business metrics. Keep in mind that a steady state is only as good as its customer reception.

For any metric you choose, you’ll need to balance:

the relationship between the metric and the underlying construct
the engineering effort required to collect the data
the latency between the metric and the ongoing behavior of the system

Characterizing Steady State

Maybe a single threshold, maybe a pattern describing periodic changes.

Forming Hypotheses

Once you have your metrics and an understanding of their steady state behavior, you can use them to define the hypotheses for your experiment. Think about how the steady state behavior will change when you inject different types of events into your system. If you add requests to a mid-tier service, will the steady state be disrupted or stay the same? If disrupted, do you expected system output to increase or decrease? Finally, think about how you will measure the change in steady state behavior.

For example, the hypotheses in Netfilx's experiments are usually in the form of “the events we are injecting into the system will not cause the system’s behavior to change from steady state.”

Canary Analysis

At Netflix, we do canary deployments: we first deploy new code to a small cluster that receives a fraction of production traffic, and then verify that that the new deployment is healthy before we do a full roll-out.

Vary Real-World Events

Unpredictable events and conditions we should consider:

Hardware failures
Functional bugs
State transmission errors (e.g., inconsistency of states between sender and receiver nodes)
Network latency and partition
Large fluctuations in input (up or down) and retry storms
Resource exhaustion
Unusual or unpredictable combinations of interservice communication
Byzantine failures (e.g., a node believing it has the most current data when it actually does not)
Race conditions
Downstream dependencies malfunction

The scope of impact and isolation for a fault is called the failure domain. We don’t need to enumerate all of the possible events that can change the system, we just need to inject the frequent and impactful ones as well as understand the resulting failure domains. Once again, only induce events that you expect to be able to handle! Induce real-world events, not just failures and latency.

Run Experiments in Production

When we run Chaos Engineering experiments, we are interested in the behavior of the entire overall system. The code is an important part of the system, but there’s a lot more to our system than just code.

State and Services
The only way to truly build confidence in the system at hand is to experiment with the actual input received by the production environment.
Input in Production
The only way to truly build confidence in the system at hand is to experiment with the actual input received by the production environment.
Other People’s Systems

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.
—Leslie Lamport

The behavior of other people’s systems will always differ between production and synthetic environments. This reinforces the fact that you want to run experiments in production, the only place where you will have an authentic interaction with those other systems.

Agents Making Changes
External Validity
Get as Close as You Can
Even if you cannot run directly in production, the closer your experimental environment is to production, the fewer threats to external validity your experiment will have, and the more confidence you can have in the results.

Automate Experiments to Run Continuously

Automatically Executing Experiments

Ideally, experiments would run with each change, kind of like a Chaos Canary. When a new risk is discovered, the operator can choose whether or not they should block the roll out of the change and prioritize a fix, being reasonably sure that the rolled out change is the cause. This approach provides insight into the onset and duration of availability risks in production. At the other extreme, annual exercises lead to more difficult investigations that essentially start from scratch and don’t provide easy insight into how long the potential issue has been in production.

Automatically Creating Experiments

The challenge of designing Chaos Engineering experiments is not identifying what causes production to break, since the data in our incident tracker has that information. What we really want to do is identify the events that shouldn’t cause production to break, and that have never before caused production to break, and continuously design experiments that verify that this is still the case.

Example: Lineage Driven Fault Injection (LDFI)

Minimize Blast Radius

The lowest-risk experiments involve few users. To accomplish this, we inject failures that verify lient-device functionality for a subset or small group of devices
The next step is to run small-scale diffuse experiments. An experiment of this style impacts a small percentage of traffic and allows the traffic to follow normal routing rules so it ends up evenly distributed throughout the production servers
The next step is to run small-scale concentrated experiments, overriding the routing of requests for all users in this experiment to direct traffic to specific boxes
The most risky and accurate experiment is large-scale without custom routing

It is imperative to be able to abort in-process experiments when they cause too much pain. Automated termination is highly recommended, particularly if experiments are running continuously in accordance with the other advanced principles.

URL

Chaos Engineering Book
Principles of Chaos Engineering

[Notes] Chaos Engineering, Building Confidence in System Behavior through Experiments