[Paper Review] Fault Injection in Production - Making the case for resilience testing
Short Introduction to This Paper
This paper gives us an introduction about how Etsy uses "GameDay" to build more confidence about their system's behavior. Specifically, it includes the discussion about 1) why apply it in production environment, 2) how to do fault injection during a GameDay exercise, 3) business justification and 4) a case, limitations and fear.
Highlights of This Paper
- Introduction about the provisioning of a server or cloud instance from zero to production
- Explanation about why many complex systems are largely intractable
- Pattern about GameDay exercise, introducing the methodology of how they doing fault injection in a real company
Key Infomation
The provisioning of a server or cloud instance from zero to production:
- Bare metal (or cloud-compute instance) is made available
- Base operating system is installed via PXE (preboot execution environment) or machine image
- Operating-system-level configurations are put into place (via configuration management or machine image)
- Application-level configurations are put into place (via configuration management, app deployment, or machine image)
- Application code is put into place and underlying services are started correctly (via configuration management, app deployment, or machine image)
- Systems integration takes place in the network (load balancers, VLANs, routing, switching, DNS, etc.)
The challenge is that many complex systems are largely intractable, meaning that:
- To be fully described, there are many details, not few
- The rate of change is high; the systems change before a full description (and therefore understanding) can be completed
- How components function is partly unknown, as they resonate with each other across varying conditions
- Processes are heterogeneous and possibly irregular
The pattern of GameDay exercise at Etsy:
- Imagine a possible untoward event in your infrastructure
- Figure out what is needed to prevent that event from affecting your business, and implement that
- Cause the event to happen in production, ultimately to prove the noneffect of the event and gain confidence surrounding it
Different types of failures in a real case about Esty's payments system:
- One of the app servers dies (power cable yanked out)
- All of the app servers leave the load-balancing pool
- One of the app servers gets wiped clean and needs to be fully rebuilt from scratch
- Database dies (power cable yanked out and/or process is killed ungracefully)
- Database is fully corrupt and needs full restore from backup
- Offsite database replica is needed to investigate/restore/replay single transactions
- Connectivity to third-party sites is cut off entirely
Limitations about GameDay exercises
- First, the exercises aren’t meant to discover how engineering teams handle working under time pressure with escalating and sometimes disorienting scenarios
- The faults and failure modes are contrived. They reflect the fault designer’s imagination and therefore can’t be comprehensive enough to guarantee the system’s safety
Another interesting point
Automated fault injection can carry with it a paradox. If the faults that are injected (even at random) are handled in a transparent and graceful way, then they can go unnoticed. You would think this was the goal: for failures not to matter whatsoever when they occur. This masking of failures, however, can result in the very complacency that they are intended (at least should be intended) to decrease. In other words, when you’ve got randomly generated and/or continual fault injection and recovery happening successfully, care must be taken to raise the detailed awareness that this is happening—when, how, where, etc. Otherwise, the failures themselves become another component that increases complexity in the system while still having limitations to their functionality (because they are still contrived and therefore insufficient)
URL
Fault Injection in Production - Making the case for resilience testing