From Theory to a Real‑World Test Plan
“Planning” Stage Is a Game‑Changer
Before you fire up an EC2 instance and start killing it, you might ask: “Why all the prep?”
In chaos engineering, the experiment’s design determines whether you learn anything useful. A poorly planned test can:
Risk | Consequence |
---|---|
No baseline metrics | You’ll never know if something truly went wrong |
Targeting production by mistake | Real customers suffer downtime |
Missing a guardrail (stop condition) | The experiment runs forever and costs you money |
Planning is the safety net that turns chaos from a destructive hobby into a disciplined, repeatable practice.
1. Identify the Right Deployment
Question | What to Look For |
---|---|
Which environment will you test? | Start in pre‑production or a dedicated test account – never jump straight into production. |
How many resources are involved? | Smaller, isolated experiments reduce risk and make troubleshooting easier. |
Tip: Create a “Chaos Lab” VPC with its own IAM role so that your experiments can only touch the lab resources.
2. Review the Architecture
- Map out all components (load balancers, ECS services, Lambda functions, RDS instances).
- Identify dependencies – e.g., an EC2 instance may talk to a Secrets Manager secret; if you kill that instance, will the entire stack fail?
- Check for recovery procedures: Are there auto‑scaling groups? Do you have health checks in place?
If you’re unsure, consult the AWS Well‑Architected Framework – it’s a quick way to spot missing resiliency controls.
3. Define “Steady State”
Metric | Example |
---|---|
Latency (ms) | < 200 ms for 95% of requests |
CPU Utilization | < 50% during normal load |
Auth failures / minute | ≤ 1 per 10,000 logins |
These numbers become your baseline – the “normal” you want to compare against after injecting faults.
4. Form a Testable Hypothesis
Template:
If fault action is performed, the metric should not exceed threshold.
Example hypothesis for an auth service
If network latency increases by 10 ms, sign‑in failures will stay below 0.5%.
Why? Because a well‑designed retry logic should absorb that extra delay.
🔧 5. Choose the Fault Action
Service | Typical Actions |
---|---|
EC2 | stop , terminate |
ECS | pause task |
RDS | modify db instance (e.g., throttle I/O) |
API Gateway | increase throttling |
Pick an action that mimics a real failure your team has seen in the past or one that is critical to your architecture.
6. Set Stop Conditions
Create CloudWatch alarms that automatically halt the experiment if something goes wrong:
- CPU > 90% for > 5 min
- Error rate > 2% of requests
- Network latency > 500 ms
These guardrails protect your production workloads and keep costs under control.
7. Prepare Monitoring & Alerting
Tool | What to Monitor |
---|---|
CloudWatch Metrics | Latency, error rates, CPU, memory |
X-Ray | Distributed trace details during the fault |
SNS / EventBridge | Notify the team if a stop condition triggers |
Without real‑time visibility you’ll have no way of telling whether your experiment succeeded or failed.
8. Run a Dry‑Run (Optional but Recommended)
AWS FIS offers a “pre‑flight” mode that simulates the experiment without executing actions. It’s a great safety check:
aws fis start-experiment \
--experiment-template-arn <TEMPLATE_ARN> \
--dry-run
If the dry‑run reports any issues (e.g., missing permissions), fix them before the live run.
✅ Checklist Before You Hit “Start”
✔ | Item |
---|---|
✅ Target environment is non‑production | |
✅ Architecture diagram reviewed | |
✅ Steady‑state metrics defined | |
✅ Hypothesis written in testable form | |
✅ Fault action chosen and scoped | |
✅ Stop conditions set up | |
✅ Monitoring dashboards ready | |
✅ Dry‑run succeeded (if used) |
Next Step
Now that you’ve got a solid plan, the next part of our mini‑series will walk through building your first experiment template in the console – from defining actions to adding stop conditions and finally launching it.