Part I Intruduction

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
-- Principles of Chaos

Using Chaos Engineering may be as simple as manually running kill -9 on a box inside of your staging environment to simulate failure of a service. Or, it can be as sophisticated as automatically designing and carrying out experiments in a production enviroment against a small but statistically significant fraction of live traffic.

The History of Chaos Engineering at Netflix: started in 2008

  • Chaos Monkey: ball rolling, gaining notoriety for turning off services in the production environment
  • Chaos Kong: transferred those benefits from the small scale to the very large
  • Failure Injection Testing (FIT): the foundation for tackling the space in between

- Read the full article -

Short Introduction to This Paper

This paper introduces and explores the idea of data poisoning, a light-weight peer-architecture technique to inject faults into Python programs. This method requires very small modification to the original program, which facilitates evaluation of sensitivity of systems that are prototyped or modeled in Python. Actually this paper doesn't show much detail about the implementation, but the types of data poisoning it declares are very interesting.

Highlights of This Paper

  • Data poisoning's symbolic expression
  • Different types of data poisoning

Key Infomation

  • Types of data poisoning: deterministic effect poisoning, intermittent effect poisoning (need define the lifetime of poisoned data), infectious/non-infectious poisoning

Relevant Future Works

  • Only doing data poisoning is not enough, we should analysis the system's behaviour under different types of perturbation

URL

Data Poisoning: Lightweight Soft Fault Injection for Python

Short Introduction to This Paper

This paper describes the motivation, innovation, design, running example and future development of a Fault Inject Tool (FIT). This tool enables controlled causing of cloud platform issues such as resource stress and service or VM outages, the purpose being to observe the subsequent effect on deployed applications.

Highlights of This Paper

  • The DICE FIT will address the need to generate various cloud agnostic faults at the VM Admin and Cloud Admin levels. So greater flexibility and the ability to generate multiple faults, relatively lightweight

Key Infomation

  • Design: To access the VM level and issue commands the DICE FIT uses SSH to connect to the Virtual Machines and issue the commands. By using JSCH, the tool is able to connect to any VM that has SSH enabled and issue commands as a pre-defined user. This allows greater flexibility of commands as well as the installation of tools and dependences.

Relevant Future Works

  • Containerised environments will also be considered as future FIT targets to help understand the effect on microservices when injecting faults to the underlying host as well as the integrity of the containerised deployment
  • The CACTOS project will expand the tool functionality by initiating a specific application level fault to trigger optimisation algorithms

URL

DICE Fault Injection Tool(Paper)
DICE-Fault-Injection-Tool(Github Project)

Short Introduction to This Paper

This paper describes how to adapt and implement a research prototype called lineage-driven fault injection (LDFI, another paper view see here) to automate failure testing at Netflix. Along the way, the authors describe the challenges that arose adapting the LDFI model to the complex and dynamic realities of the Netflix architecture. They show how they implemented the adapted algorithm as a service atop the existing tracing and fault injection infrastructure, and present preliminary results.

Highlights of This Paper

  • The way of how they adapted LDFI to automate failure testing at Netflix is worth learning, including defining request classes, learning mappings, replay mechanisms.

Key Infomation

  • More explanations about LDFI: It begins with a correct outcome, and asks why the system produced it. This recursive process of asking why questions yields a lineage graph that characterizes all of the computations and data that contributed to the outcome. By doing so, it reveals the system’s implicit redundancy by capturing the various alternative computations that could produce the good result.
  • About the solutions in the SAT problem: It does not necessarily indicate a bug, but rather a hypothesis that must be tested via fault injection
  • Implementation

    • Training: the service collects production traces from the tracing infrastructure and uses them to build a classifier that determines, given a user request, to which request class it belongs. We found that the most predictive features include service URI, device type, and a variety of query string parameters including parameters passed to the Falcor data platform
    • Model enrichment: the service also uses production traces generated by experiments (fault injection exercises that test prior hypotheses produced by LDFI) to update its internal model of alternatives. Intuitively, if an experiment failed to produce a user-visible error, then the call graph generated by that execution is evidence of an alternative computation not yet represented in the model, so it must be added. Doing so will effectively prune the space of future hypotheses
    • Experiments: finally, the service occasionally "installs" a new set of experiments on Zuul. This requires providing Zuul with an up-to-date classifier and the current set of failure hypotheses for each active request class. Zuul will then (for a small fraction of user requests) consult the classifier and appropriately decorate user requests with fault metadata

Relevant Future Works

  • They used the single-label classifier for the first release of the LDFI service, but are continuing to investigate the multi-label formulation

URL

Automating Failure Testing Research at Internet Scale