A Brief Introduction to The ChaosMachine for Verification and Analysis of Error-handling in the JVM (updated)
When we build applications, one of our aims should be making them resilient. A good application can sustain its operations in the face of different kinds of failure. The final tests for this don't begin until the application is deployed into a production environment, after which we cannot predict its trials or their results. A new approach is to change our perspective on errors in software systems by not preventing them all the time, but triggering the faults in some controlled situation, learning from the behavior of the application, and finally improving its resilience. To this end, we will design this chaos agent, and the first version will be focused on verification and analysis of error-handling in the JVM.
About Chaos Engineering and Antifragile Software
If you are not familiar with chaos engineering, we provide introductory materials about this technique at the end of this article. Chaos engineering is the practice of experimenting on a distributed system in order to build confidence in the system’s capability to withstand unexpected conditions in production. As for antifragility, it's the antonym of "fragility". Traditional means to combat fragility include: fault prevention, fault tolerance, fault removal, and fault forecasting. However, the contributions of those techniques are insufficient; we propose another perspective on system errors. If we can build mechanisms to let the system experience errors and use those to learn from the failures in a controlled environment, we can build confidence in our system's resilience. The goal of chaos engineering and antifragile design is to perform these perturbations and learn from the experience