Sample your production deployment artifacts, topology, and traffic, allowing you to construct a plan to build complete or partial variants of your system and the communication paths between your services.
Classify and replay requests across your system deterministically across 100s of simulated variant environments running under controlled failure conditions. Our algorithms reduce the problem space and learn over time to zero in on the more vulnerable and critical parts of your system.
Ongoing reports help you understand the safety of your deployment. Helping you uncover critical components, unexpected or emergent behavior, or coupling between services that you didn't know existed. You can prioritize where to improve the resilience of your system and monitor if your system is getting better or worse over time.
Build high-availability cloud native applications
simulate how cloud applications can fail by running continuous experiments, discovering which work leads to increased availability
Resilience, safety, confidence
As systems get more complex, failure is inevitable.
With distributed architectures the unknown unknowns of how the behavior of one component of the system can cause other parts to fail can become difficult to reason about without observing them interact.
With applications running on managed infrastructure and partially composed of black box 3rd party APIs, owning the availability of your system is no longer exclusively within your control. You no longer own your availability entirely and have to shift your focus around failure tolerance to application behavior.
Thinking about, and practicing chaos engineering, is becoming essential to operating a cloud native architecture system safely.
Having a way to observe how your application behaves in failure conditions helps you qualify a level of risk you are willing to accept about your system. Prioritizing which "technical debt" to pay off, and where NOT to.
Proactively experimenting with failure, and observing how components respond when chaos strikes helps you understand the inherent risk in your sysyem, so you can ship changes confidently and safely and experience fewer incidents by:
- Observing emergent behavior
- Detecting complex faults
- Surfacing unknown or transitive dependencies
- Promoting isolation of concerns
- Identifying acceptable error rates and testing how behavior degrades gracefully
- Observing peripheral aspects of the system