Short Introduction to This Paper

This paper describes the Chaos Automation Platform (ChAP), a system for running failure injection experiments on the production system to verify that failures in non-critical services do not result in system outages.

A Concrete Example of Chaos Experiment

Consider the following scenario: Alice, a (fictional) QA engineer on the Gallery team, wants to verify that Netflix is resilient to failures in the Gallery service. She uses ChAP’s web interface to define an experiment. Because ChAP injects failures on the client side of the request, she selects the API server group as the subject of the experiment. She specifies that all calls to the Gallery service should fail. She chooses to divert only a small amount of traffic for this experiment: 0.3%. She chooses a duration of 30 minutes for the experiment.

Finally, she selects the metrics that she is interested in observing for the experiment. She chooses a number of Hystrix commands to track for the experiment. Hystrix is a library that allows engineers to wrap RPC calls and specify what the fallback behavior should be if an RPC call fails. Each Hystrix command has a name, e.g.: “GetGallery”.

For each Hystrix command, for the control and experi-ment server groups, ChAP will display counts of:

  • successful requests served
  • successful fallbacks served
  • failed fallbacks served

traffic redirecting.jpg

A fraction of traffic is routed to the control and experiment server groups. Failures are only injected into the experiment server group

Further Reading

A personalized list of recommenda-tions and bookmarks that recall where you left off when previously watching a video add value to the user, but if the services that implement these features stop working, we should still be able to provide a reasonable user experience. Hodges describes this kind of graceful degradation as partial availability.

J. Hodges, “Notes on distributed systems for young bloods,” https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/, January 14 2013, something Similar blog.

URL

A Platform for Automating Chaos Experiments

Tag:Paper Review, Chaos Engineering

Add a new comment.