Learn/Chaos Engineering
CONCEPTS

Chaos Engineering

The practice of intentionally injecting failures into systems to build resilience.

Chaos Engineering

The practice of intentionally injecting failures into systems to build resilience.

"Breaking Things on Purpose"

Chaos Engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions.

The Scientific Method

It is not just "breaking stuff." It follows a process:

  1. Hypothesis: "If we kill the checkout service, the site will still serve the homepage."
  2. Experiment: Kill the checkout service.
  3. Observation: Did the homepage load? Or did the whole site crash?
  4. Learning: Fix the dependency.

Principles

  • Minimize Blast Radius: Don't take down the whole site. Start with 1% of users.
  • Stop Button: Always have a big red button to stop the experiment instantly.
  • Production: Staging is not Production. Eventually, you must test in Prod.

ExChaos Monkey

"Netflix created Chaos Monkey to randomly kill servers in production."

Impact
Engineers ignored it at first, causing outages.
Resolution
Teams were forced to architect their services to be stateless and auto-healing. Now, losing a server is a non-event.

Why Chaos Engineering Matters

Systems will fail. Better to fail on your terms than during a real incident.

Chaos engineering exposes weaknesses before customers do.

Common Pitfalls

Chaos without Monitoring
Injecting failure when you are blind is just arson. You need Observability first.

How to Use Chaos Engineering

🎯
Start Small: Test in staging first.
📊
Measure Impact: Have clear success/failure criteria.
🔄
Run Regularly: Monthly game days keep teams sharp.

Frequently Asked Questions

Is this dangerous?
Yes. That is why you start in Staging and plan carefully.
Do we need a tool?
Scripts work fine. Tools like Gremlin make it safer and easier to track.

Learn More