Posts tagged with Lineage-driven

Short Introduction to This Paper

This paper describes how to adapt and implement a research prototype called lineage-driven fault injection (LDFI, another paper view see here) to automate failure testing at Netflix. Along the way, the authors describe the challenges that arose adapting the LDFI model to the complex and dynamic realities of the Netflix architecture. They show how they implemented the adapted algorithm as a service atop the existing tracing and fault injection infrastructure, and present preliminary results.

Highlights of This Paper

  • The way of how they adapted LDFI to automate failure testing at Netflix is worth learning, including defining request classes, learning mappings, replay mechanisms.

Key Infomation

  • More explanations about LDFI: It begins with a correct outcome, and asks why the system produced it. This recursive process of asking why questions yields a lineage graph that characterizes all of the computations and data that contributed to the outcome. By doing so, it reveals the system’s implicit redundancy by capturing the various alternative computations that could produce the good result.
  • About the solutions in the SAT problem: It does not necessarily indicate a bug, but rather a hypothesis that must be tested via fault injection
  • Implementation

    • Training: the service collects production traces from the tracing infrastructure and uses them to build a classifier that determines, given a user request, to which request class it belongs. We found that the most predictive features include service URI, device type, and a variety of query string parameters including parameters passed to the Falcor data platform
    • Model enrichment: the service also uses production traces generated by experiments (fault injection exercises that test prior hypotheses produced by LDFI) to update its internal model of alternatives. Intuitively, if an experiment failed to produce a user-visible error, then the call graph generated by that execution is evidence of an alternative computation not yet represented in the model, so it must be added. Doing so will effectively prune the space of future hypotheses
    • Experiments: finally, the service occasionally "installs" a new set of experiments on Zuul. This requires providing Zuul with an up-to-date classifier and the current set of failure hypotheses for each active request class. Zuul will then (for a small fraction of user requests) consult the classifier and appropriately decorate user requests with fault metadata

Relevant Future Works

  • They used the single-label classifier for the first release of the LDFI service, but are continuing to investigate the multi-label formulation

URL

Automating Failure Testing Research at Internet Scale

Short Introduction to This Paper

This paper proposes a novel approach for discovering bugs in fault-tolerant data management systems: lineage-driven fault injection. A lineage-driven fault injector reasons backwards from correct system outcomes to determine whether failures in the execution could have prevented the outcome. The authors present MOLLY, a prototype of lineage-driven fault injection that exploits a novel combination of data lineage techniques from the database literature and state-of-the-art satisfiability testing. If fault-tolerance bugs exist for a particular configuration, MOLLY finds them rapidly, in many cases using an order of magnitude fewer executions than random fault injection. Otherwise, MOLLY certifies that the code is bug-free for that configuration.

Highlights of This Paper

  • A novel top-down strategy for discovering bugs in distributed data management systems: lineage-driven fault injection (LDFI)
  • MOLLY, an implementation of LDFI. Like fault injection, MOLLY finds bugs in large-scale, complex distributed systems quickly, in many cases using an order of magnitude fewer executions than a random fault injector

Key Infomation

1.png

Figure 1: The MOLLY system performs forward/backward alternation: forward steps simulate concrete distributed executions with failures to produce lineage-enriched outputs, while backward steps use lineage to search for failure combinations that could falsify the outputs. MOLLY alternates between forward and backward steps until it has exhausted the possible falsifiers.

2.png

  • LDFI is inspired by the database literature notion of data lineage, which allows it to directly connect system outcomes to the data and messages that led to them. LDFI uses data lineage to reason backwards (from effects to causes) about whether a given correct outcome could have failed to occur due to some combination of faults. Rather than generating faults at random (or using application-specific heuristics), a lineage-driven fault injector chooses only those failures that could have affected a known good outcome, exercising fault-tolerance code at increasing levels of complexity. Injecting faults in this targeted way allows LDFI to provide completeness guarantees like those achievable with formal methods such as model checking, which have typically been used to verify small protocols in a bottom-up fashion. When bugs are encountered, LDFI’s topdown approach provides—in addition to a counterexample trace fine-grained data lineage visualizations to help programmers understand the root cause of the bad outcome and consider possible remediation strategies
  • Like formal methods, MOLLY finds all of the bugs that could be triggered by failures: when a MOLLY execution completes without counterexamples it certifies that no fault-tolerance bugs exist for a given configuration. MOLLY integrates naturally with root-cause debugging by converting counterexamples into data lineage visualizations.
  • MOLLY quickly identifies 7 critical bugs in 14 fault-tolerant systems, for the remaining 7 systems, it provides a guarantee that no invariant violations exist up to a bounded execution depth, an assurance that state-of-the-art fault injectors cannot provide

Limitations

  • While MOLLY can be used to find bugs in fundamentally non-deterministic protocols like anti-entropy or randomized consensus, certifying such protocols as bug-free will require additional research
  • An abstraction that factors apart partial failure and asynchrony is fundamentally incomplete, because these algorithms are required to (attempt to) distinguish between delay and failure

Relevant Future Works

  • Verifying recovery protocols by modeling crash-recovery failures
  • Many of the why-not provenance techniques discussed in the paper could assist in implementing LDFI
  • Relaxing the synchronicity abstraction
  • Explore using the backwards reasoning approach of LDFI to assist in fault-tolerant program synthesis

URL

Lineage-driven Fault Injection