Failure Simulator
The Failure Simulator lets you answer one of the most important questions in infrastructure management: "What happens if this resource fails?" By selecting any resource in your environment and simulating its failure, you can see exactly which downstream services would be affected, how deep the cascade goes, and what percentage of your infrastructure is at risk.
This is a read-only analysis tool. No changes are made to your actual infrastructure -- the Failure Simulator runs entirely against Guardian Pro's dependency graph.
Why Simulate Failures
Understanding cascade failures before they happen is critical for building resilient infrastructure. The Failure Simulator helps you:
- Identify hidden dependencies -- Discover failure paths you did not know existed
- Validate redundancy -- Confirm that your high-availability configurations actually prevent cascading failures
- Prioritise resilience investments -- Focus engineering effort on the resources with the highest blast radius
- Prepare incident response plans -- Know in advance which services will be affected when a specific resource fails
- Support architecture reviews -- Provide concrete evidence of resilience gaps during design reviews
Run failure simulations on your most critical resources regularly, especially after infrastructure changes. A new dependency added by a deployment could create a cascade path that did not exist before.
Running a Simulation
Step 1: Select a Resource
From the Failure Simulator tab in the Architecture Advisor, choose the resource you want to simulate a failure for. You can search by:
- Resource name -- The friendly name or identifier
- Resource type -- Filter by AWS service type (EC2 instance, RDS database, Lambda function, etc.)
- Region -- Filter by AWS region
Step 2: Run the Simulation
Click Simulate Failure to begin the analysis. Guardian Pro traverses the dependency graph starting from the selected resource, following all downstream dependency paths to determine which resources would be affected.
The simulation typically completes within seconds, even for large environments.
Step 3: Review Results
The simulation results show:
| Metric | Description |
|---|---|
| Affected resources | The total number of resources that would be impacted by the failure |
| Cascade depth | How many levels deep the failure propagates through the dependency chain |
| Impact percentage | What percentage of your total infrastructure would be affected |
| Affected services | Which AWS service types are in the impact path |
Understanding the Cascade Visualisation
The simulation results include an interactive graph that shows the failure cascade path:
- The source resource is highlighted as the origin of the failure
- Directly dependent resources are shown at the first level of the cascade
- Indirectly affected resources appear at deeper levels, connected through the dependency chain
- Failure paths are highlighted to show exactly how the failure propagates from one resource to the next
The visualisation makes it easy to trace the specific path through which a failure in one resource would eventually affect a distant, seemingly unrelated service.
The cascade visualisation uses the same dependency graph that powers the Architecture Map. Both views draw from the same underlying data -- the Architecture Map shows your full topology, while the Failure Simulator highlights a specific failure path within it.
Interpreting Impact Percentage
The impact percentage is one of the most important metrics in the simulation results. It tells you what fraction of your total monitored infrastructure would be affected by the simulated failure.
| Impact Level | Interpretation | Action |
|---|---|---|
| Below 5% | Localised impact. The failure would affect a small, contained set of resources. | Monitor but low urgency |
| 5% - 20% | Moderate impact. A noticeable portion of your infrastructure is in the blast radius. | Review redundancy for the source resource |
| 20% - 50% | Significant impact. A large segment of your environment would be disrupted. | Prioritise resilience improvements |
| Above 50% | Critical impact. The majority of your infrastructure depends on this resource. | Address immediately with redundancy and isolation |
Common Simulation Scenarios
Database Failure
Simulating the failure of a primary database instance reveals which application tiers, APIs, and downstream services depend on it. This is especially useful for validating that multi-AZ failover is properly configured and that application connections would fail over gracefully.
NAT Gateway Failure
NAT gateways are often shared by multiple subnets and services. Simulating their failure shows which resources would lose outbound internet connectivity, helping you decide whether to deploy redundant NAT gateways.
Load Balancer Failure
Load balancers are critical shared infrastructure. A failure simulation reveals the full set of backend services that would become unreachable, helping you assess whether redundant load balancing is needed.
VPC or Subnet Failure
Simulating the failure of networking components shows the broadest possible blast radius, since many resources depend on the underlying network layer. This is useful for disaster recovery planning.
If a simulation shows that a single resource failure would cascade to affect a large percentage of your infrastructure, this is a strong signal that your architecture needs isolation boundaries or redundancy improvements. Consider using the Risk Radar to view related risks.
Using Simulations for Architecture Decisions
The Failure Simulator is a powerful tool for making informed architecture decisions:
Before Deploying New Infrastructure
Run simulations on the resources your new service will depend on. If a critical dependency has a high blast radius, consider adding redundancy before deploying.
During Incident Post-Mortems
After an outage, simulate the failure that occurred to compare the actual impact against what Guardian Pro predicted. This validates your dependency mapping and helps you identify gaps.
For Compliance and Audit
Many compliance frameworks (including AWS Well-Architected) require evidence that failure scenarios have been evaluated. Failure simulation results can serve as documentation of your resilience analysis.
When Planning Cost Optimisation
Before removing resources flagged as underutilised, simulate their failure to confirm they are truly not in any critical dependency path. This prevents accidental service disruptions during cost optimisation.
Simulation Limitations
The Failure Simulator analyses dependencies based on Guardian Pro's resource discovery and relationship mapping. Keep the following in mind:
- Application-level dependencies -- The simulator analyses infrastructure-level dependencies (network, IAM, storage). Application-level dependencies (such as a microservice calling another microservice's API) are detected where AWS service integrations exist, but custom application routing may not be fully mapped.
- External dependencies -- Dependencies on services outside your AWS environment (third-party APIs, on-premises systems) are not included in the simulation.
- Point-in-time analysis -- Simulations run against the most recent scan data. If your infrastructure has changed since the last scan, run a new scan before simulating.
For the most accurate simulations, ensure your resource discovery scan is up to date. You can trigger a scan from the Dashboard before running simulations.
Next Steps
- Architecture Map -- Visualise your full infrastructure topology and the dependencies that the Failure Simulator analyses.
- Risk Radar -- View architectural risks that the Failure Simulator can help you investigate further.
- Health Score -- Understand how identified risks affect your overall infrastructure health.