Short Introduction to This Paper

This paper proposes a novel approach for discovering bugs in fault-tolerant data management systems: lineage-driven fault injection. A lineage-driven fault injector reasons backwards from correct system outcomes to determine whether failures in the execution could have prevented the outcome. The authors present MOLLY, a prototype of lineage-driven fault injection that exploits a novel combination of data lineage techniques from the database literature and state-of-the-art satisfiability testing. If fault-tolerance bugs exist for a particular configuration, MOLLY finds them rapidly, in many cases using an order of magnitude fewer executions than random fault injection. Otherwise, MOLLY certifies that the code is bug-free for that configuration.

Highlights of This Paper

  • A novel top-down strategy for discovering bugs in distributed data management systems: lineage-driven fault injection (LDFI)
  • MOLLY, an implementation of LDFI. Like fault injection, MOLLY finds bugs in large-scale, complex distributed systems quickly, in many cases using an order of magnitude fewer executions than a random fault injector

Key Infomation

1.png

Figure 1: The MOLLY system performs forward/backward alternation: forward steps simulate concrete distributed executions with failures to produce lineage-enriched outputs, while backward steps use lineage to search for failure combinations that could falsify the outputs. MOLLY alternates between forward and backward steps until it has exhausted the possible falsifiers.

2.png

  • LDFI is inspired by the database literature notion of data lineage, which allows it to directly connect system outcomes to the data and messages that led to them. LDFI uses data lineage to reason backwards (from effects to causes) about whether a given correct outcome could have failed to occur due to some combination of faults. Rather than generating faults at random (or using application-specific heuristics), a lineage-driven fault injector chooses only those failures that could have affected a known good outcome, exercising fault-tolerance code at increasing levels of complexity. Injecting faults in this targeted way allows LDFI to provide completeness guarantees like those achievable with formal methods such as model checking, which have typically been used to verify small protocols in a bottom-up fashion. When bugs are encountered, LDFI’s topdown approach provides—in addition to a counterexample trace fine-grained data lineage visualizations to help programmers understand the root cause of the bad outcome and consider possible remediation strategies
  • Like formal methods, MOLLY finds all of the bugs that could be triggered by failures: when a MOLLY execution completes without counterexamples it certifies that no fault-tolerance bugs exist for a given configuration. MOLLY integrates naturally with root-cause debugging by converting counterexamples into data lineage visualizations.
  • MOLLY quickly identifies 7 critical bugs in 14 fault-tolerant systems, for the remaining 7 systems, it provides a guarantee that no invariant violations exist up to a bounded execution depth, an assurance that state-of-the-art fault injectors cannot provide

Limitations

  • While MOLLY can be used to find bugs in fundamentally non-deterministic protocols like anti-entropy or randomized consensus, certifying such protocols as bug-free will require additional research
  • An abstraction that factors apart partial failure and asynchrony is fundamentally incomplete, because these algorithms are required to (attempt to) distinguish between delay and failure

Relevant Future Works

  • Verifying recovery protocols by modeling crash-recovery failures
  • Many of the why-not provenance techniques discussed in the paper could assist in implementing LDFI
  • Relaxing the synchronicity abstraction
  • Explore using the backwards reasoning approach of LDFI to assist in fault-tolerant program synthesis

URL

Lineage-driven Fault Injection

Short Introduction to This Paper

This paper aims at analyzing and improving how software handles unanticipated exceptions. The first objective is to set up contracts about exception handling and a way to assess them automatically. The second one is to improve the resilience capabilities of software by transforming the source code. The authors devise an algorithm, called short-circuit testing, which injects exceptions during test suite execution so as to simulate unanticipated errors. It is a kind of fault-injection techniques dedicated to exceptionhandling. This algorithm collects data that is used for verifying two formal contracts that capture two resilience properties w.r.t. exceptions: the source-independence and pure-resilience contracts. Then the team propose a code modification technique, called “catch-stretching” which allows error-recovery code (of the form of catch blocks) to be more resilient.

Highlights of This Paper

  • This work shows that it is possible to reason on software resilience by injecting exceptions during test suite execution
  • Definition of two contracts for exception handling: source independence contract, pure resilience contract
  • An algorithm and four predicates to verify whether a try-catch satisfies those contracts
  • A source code transformation to improve the resilience against exceptions
  • An empirical evaluation on 9 open sources applications with one test suite each showing that there exists resilient try-catch blocks in practice

Key Infomation

1.png

2.png

  • Source-independent: A try-catch is source-independent if the catch block proceeds equivalently, whatever the source of the caught exception is in the try block
  • Pure Resilience: A try-catch is purely resilient if the system state is equivalent at the end of the try-catch execution whether or not an exception occurs in the try block
  • Short-circuit testing consists of dynamically injecting exceptions during the test suite execution in order to analyze the resilience of try-catch blocks
  • Catch Stretching: Replacing the type of the caught exceptions so that they catch more exceptions than before. For instance, replacing catch(FileNotFoundException e) by catch(IOExceptione). The extreme of catch stretching is to parametrize the catch with the most generic type of exceptions(e.g. Throwable in Java, Exception in .NET)

Relevant Future Works

  • Further exploring how to improve the resilience of software applications: the scope of try blocks can be automatically adapted while still satisfying the test suite
  • The purely resilient catch blocks could probably be used elsewhere because they have a real recovery power
  • The resilience oracle has not to be only a test suite, but for example metamorphic relations or production traces
  • Automated refactoring of the relevant test suite

Questions

  • How to do catch stretching when there is a try with multiple catch blocks? And maybe the original test suites are not enough to verify new catch blocks

URL

Exception Handling Analysis and Transformation Using Fault Injection - Study of Resilience Against Unanticipated Exceptions

Short Introduction to This Paper

Many perturbations do not break the correctness in programs. This phenomenon is called "correctness attraction". This paper devises a novel protocol, which considers a systematic exploration of the perturbation space as well as perfect oracles to determine the correctness of the output. To this extent, a qualitative manual analysis enables it to set up the first taxonomy ever of the reasons behind correctness attraction.

Highlights of This Paper

  • A protocol called ATTRACT to study the stability of programs under perturbation
  • Explored the perturbability of 10 subjects for two perturbation models, PONE for integers and PBOOL for booleans.
  • Original taxonomy of the causes behind correctness attraction: natural randomization, potential aternative executions, fixed point effect, extra resources, relaxed problem, nullified perturbation, overfit to input data

Key Infomation

  • An execution perturbation is a runtime change of the value of one variable in a statement or an expression. An execution perturbation has 3 characteristics: time: when the change occurs (e.g. at the second and the fourth iteration of a loop condition), location: where in the code (e.g. on variable ‘i’ at line 42) and perturbation model: what is this change, according to the type of the location (e.g. +1 on an integer value)
  • The perturbation space for an input is composed of of all possible unique perturbed executions according to a perturbation model
  • Correctness attraction is the phenomenon by which the correctness of an output is not impacted by execution perturbation. Correctness attraction means that one can perturb an execution while keeping the output correct according to a perfect oracle

TIM截图20180111155334.png

Relevant Future Works

  • The taxonomy is probably not complete since the benchmark does not reflect the diversity of software

URL

Correctness Attraction - A Study of Stability of Software Behavior Under Runtime Perturbation

P40718-200440.jpg

UNADJUSTEDNONRAW_thumb_1783.jpg

工位的变化,第一张应该是2013~2014年那会儿在朗科的时候,第二张是2017年搬家到飞亚达的时候。多了各种装备,爽得飞起~!

706A0001.JPG

2014年我们在北京召开第二届犀牛鸟创新人才培养研讨会,很久没有主持会议了,那次还真有点紧张,现场照着稿子读的比较快,也不敢抬头。记得开场之后刚哥和飞哥还分别鼓励我,说主持的挺好,语速再慢些、从容一些就更好了\~\~后来主持的场合多了,也就更自然了,还是要不断练习呀\~\~