Amar - !# ChownCloud

From Theory to a Real‑World Test Plan

“Planning” Stage Is a Game‑Changer

Before you fire up an EC2 instance and start killing it, you might ask: “Why all the prep?”
In chaos engineering, the experiment’s design determines whether you learn anything useful. A poorly planned test can:

Risk	Consequence
No baseline metrics	You’ll never know if something truly went wrong
Targeting production by mistake	Real customers suffer downtime
Missing a guardrail (stop condition)	The experiment runs forever and costs you money

Planning is the safety net that turns chaos from a destructive hobby into a disciplined, repeatable practice.

1. Identify the Right Deployment

Question	What to Look For
Which environment will you test?	Start in pre‑production or a dedicated test account – never jump straight into production.
How many resources are involved?	Smaller, isolated experiments reduce risk and make troubleshooting easier.

Tip: Create a “Chaos Lab” VPC with its own IAM role so that your experiments can only touch the lab resources.

2. Review the Architecture

Map out all components (load balancers, ECS services, Lambda functions, RDS instances).
Identify dependencies – e.g., an EC2 instance may talk to a Secrets Manager secret; if you kill that instance, will the entire stack fail?
Check for recovery procedures: Are there auto‑scaling groups? Do you have health checks in place?

If you’re unsure, consult the AWS Well‑Architected Framework – it’s a quick way to spot missing resiliency controls.

3. Define “Steady State”

Metric	Example
Latency (ms)	< 200 ms for 95% of requests
CPU Utilization	< 50% during normal load
Auth failures / minute	≤ 1 per 10,000 logins

These numbers become your baseline – the “normal” you want to compare against after injecting faults.

4. Form a Testable Hypothesis

Template:
If fault action is performed, the metric should not exceed threshold.

Example hypothesis for an auth service

If network latency increases by 10 ms, sign‑in failures will stay below 0.5%.

Why? Because a well‑designed retry logic should absorb that extra delay.

🔧 5. Choose the Fault Action

Service	Typical Actions
EC2	`stop`, `terminate`
ECS	`pause task`
RDS	`modify db instance` (e.g., throttle I/O)
API Gateway	`increase throttling`

Pick an action that mimics a real failure your team has seen in the past or one that is critical to your architecture.

6. Set Stop Conditions

Create CloudWatch alarms that automatically halt the experiment if something goes wrong:

CPU > 90% for > 5 min
Error rate > 2% of requests
Network latency > 500 ms

These guardrails protect your production workloads and keep costs under control.

7. Prepare Monitoring & Alerting

Tool	What to Monitor
CloudWatch Metrics	Latency, error rates, CPU, memory
X-Ray	Distributed trace details during the fault
SNS / EventBridge	Notify the team if a stop condition triggers

Without real‑time visibility you’ll have no way of telling whether your experiment succeeded or failed.

8. Run a Dry‑Run (Optional but Recommended)

AWS FIS offers a “pre‑flight” mode that simulates the experiment without executing actions. It’s a great safety check:

aws fis start-experiment \
  --experiment-template-arn <TEMPLATE_ARN> \
  --dry-run

If the dry‑run reports any issues (e.g., missing permissions), fix them before the live run.

✅ Checklist Before You Hit “Start”

✔	Item
✅ Target environment is non‑production
✅ Architecture diagram reviewed
✅ Steady‑state metrics defined
✅ Hypothesis written in testable form
✅ Fault action chosen and scoped
✅ Stop conditions set up
✅ Monitoring dashboards ready
✅ Dry‑run succeeded (if used)

Next Step

Now that you’ve got a solid plan, the next part of our mini‑series will walk through building your first experiment template in the console – from defining actions to adding stop conditions and finally launching it.

A Beginner’s Guide to Chaos Engineering on AWS

If you’ve ever watched a distributed system crash mid‑deployment, you know how expensive downtime can be. The usual “watch the logs” approach is slow and often misses the root cause because real‑world faults rarely appear in test environments.

AWS Fault Injection Service (FIS) – your own chaos‑engineering playground. Think of it as a safety‑net that lets you create the exact conditions that could bring down an application, then observe how it behaves. By running these experiments before a big release or during maintenance windows, you can:

Benefit	Real‑world impact
Proactive resilience	Catch hidden bottlenecks in staging instead of production
Confidence for releases	Verify that auto‑scaling and recovery scripts actually work
Cost control	Spot inefficiencies early, saving on over‑provisioned resources

What Is AWS FIS?

AWS Fault Injection Service is a managed service that lets you run fault injection experiments against your AWS workloads.

It follows the same principles as chaos engineering: intentionally introduce failures (network latency, CPU throttling, instance termination) to see how your system reacts. If your application can recover gracefully, you’re good; if not, you get concrete data on what needs improvement.

How It Works – The Core Concepts

Concept	What it is	Why it matters
Experiment	A single run that applies a set of faults to your resources.	Gives you the “what happens” snapshot.
Experiment Template	Blueprint for an experiment: actions, targets, stop‑conditions.	Reusable; version control for chaos tests.
Action	The fault itself (e.g., pause EC2 instance, drop network packets).	The cause of the experiment.
Target	The resource(s) on which an action runs (EC2, ECS task, Lambda function, etc.).	Pinpoints where you’re injecting failure.
Stop Condition	A CloudWatch alarm that halts the experiment when a threshold is crossed.	Safety guardrail to avoid catastrophic outages.

Tip: Think of an experiment template as a “recipe” – you can bake it many times with different ingredients (targets) or tweak the spice level (action duration).

Where It Fits Into Your Workflow

Planning – Define what you want to test (e.g., “Will my ECS service recover if one container stops?”).
Template Creation – Use the console, CLI (aws fis create-experiment-template), or CloudFormation (AWS::FIS::ExperimentTemplate).
Run & Monitor – Execute via console or aws fis start-experiment. Watch logs in CloudWatch and the AWS FIS dashboard.
Analyze – Review metrics, identify bottlenecks, update your architecture or automation scripts.

You can even embed experiments into CI/CD pipelines (e.g., GitHub Actions → FIS CLI call) for automated resilience testing.

Getting Started With the AWS Console

Open AWS Management Console → Fault Injection Service.
Click Create experiment template.
Pick an action (say, “Terminate EC2 instance”) and set a target group (all instances in dev tag).
Add a stop condition: e.g., “If CPU utilization < 10% for > 30 s, stop the experiment.”
Review & create.

Pro tip: Use the “Pre‑flight” feature to simulate the experiment without actually injecting faults. It gives you a preview of what will happen.

Pricing Snapshot

FIS charges per minute that an action runs, based on the number of target accounts. It’s a pay‑as‑you‑go model:

Action	Price (USD)
CPU throttling	$0.02/min
Instance termination	$0.04/min
Network latency injection	$0.03/min

Always check the for the latest rates.

⚠️ Important Safety Note

“AWS FIS carries out real actions on real AWS resources in your system.”

Never run an experiment directly against production without first testing it in a sandbox or staging environment. Treat each experiment as a potential live‑event – use stop conditions and guardrails to keep it under control.

Want to Dive Deeper?

Bottom Line

AWS FIS gives you the tools to ask hard questions about your system’s resilience before they become production disasters. By injecting controlled faults, you learn how your architecture behaves under pressure and can iterate faster with confidence.

Part 2: “Planning Your First AWS Fault Injection Service Experiment.”

!# ChownCloud

A Cloudy Blog

Author Archives → Amar

AWS Fault Injection Service – Part 2 – Planning Your First AWS FIS Experiment