Failure Simulator

The Failure Simulator lets you answer one of the most important questions in infrastructure management: "What happens if this resource fails?" By selecting any resource in your environment and simulating its failure, you can see exactly which downstream services would be affected, how deep the cascade goes, and what percentage of your infrastructure is at risk.

This is a read-only analysis tool. No changes are made to your actual infrastructure -- the Failure Simulator runs entirely against Guardian Pro's dependency graph.

Why Simulate Failures

Understanding cascade failures before they happen is critical for building resilient infrastructure. The Failure Simulator helps you:

Identify hidden dependencies -- Discover failure paths you did not know existed
Validate redundancy -- Confirm that your high-availability configurations actually prevent cascading failures
Prioritise resilience investments -- Focus engineering effort on the resources with the highest blast radius
Prepare incident response plans -- Know in advance which services will be affected when a specific resource fails
Support architecture reviews -- Provide concrete evidence of resilience gaps during design reviews

tip

Run failure simulations on your most critical resources regularly, especially after infrastructure changes. A new dependency added by a deployment could create a cascade path that did not exist before.

Running a Simulation

Step 1: Select a Resource

From the Failure Simulator tab in the Architecture Advisor, choose the resource you want to simulate a failure for. You can search by:

Resource name -- The friendly name or identifier
Resource type -- Filter by AWS service type (EC2 instance, RDS database, Lambda function, etc.)
Region -- Filter by AWS region

Step 2: Run the Simulation

Click Simulate Failure to begin the analysis. Guardian Pro traverses the dependency graph starting from the selected resource, following all downstream dependency paths to determine which resources would be affected.

The simulation typically completes within seconds, even for large environments.

Step 3: Review Results

The simulation results show:

Metric	Description
Affected resources	The total number of resources that would be impacted by the failure
Cascade depth	How many levels deep the failure propagates through the dependency chain
Impact percentage	What percentage of your total infrastructure would be affected
Affected services	Which AWS service types are in the impact path

Understanding the Cascade Visualisation

The simulation results include an interactive graph that shows the failure cascade path:

The source resource is highlighted as the origin of the failure
Directly dependent resources are shown at the first level of the cascade
Indirectly affected resources appear at deeper levels, connected through the dependency chain
Failure paths are highlighted to show exactly how the failure propagates from one resource to the next

The visualisation makes it easy to trace the specific path through which a failure in one resource would eventually affect a distant, seemingly unrelated service.

info

The cascade visualisation uses the same dependency graph that powers the Architecture Map. Both views draw from the same underlying data -- the Architecture Map shows your full topology, while the Failure Simulator highlights a specific failure path within it.

Interpreting Impact Percentage

The impact percentage is one of the most important metrics in the simulation results. It tells you what fraction of your total monitored infrastructure would be affected by the simulated failure.

Impact Level	Interpretation	Action
Below 5%	Localised impact. The failure would affect a small, contained set of resources.	Monitor but low urgency
5% - 20%	Moderate impact. A noticeable portion of your infrastructure is in the blast radius.	Review redundancy for the source resource
20% - 50%	Significant impact. A large segment of your environment would be disrupted.	Prioritise resilience improvements
Above 50%	Critical impact. The majority of your infrastructure depends on this resource.	Address immediately with redundancy and isolation

Common Simulation Scenarios

Database Failure

Simulating the failure of a primary database instance reveals which application tiers, APIs, and downstream services depend on it. This is especially useful for validating that multi-AZ failover is properly configured and that application connections would fail over gracefully.

NAT Gateway Failure

NAT gateways are often shared by multiple subnets and services. Simulating their failure shows which resources would lose outbound internet connectivity, helping you decide whether to deploy redundant NAT gateways.

Load Balancer Failure

Load balancers are critical shared infrastructure. A failure simulation reveals the full set of backend services that would become unreachable, helping you assess whether redundant load balancing is needed.

VPC or Subnet Failure

Simulating the failure of networking components shows the broadest possible blast radius, since many resources depend on the underlying network layer. This is useful for disaster recovery planning.

warning

If a simulation shows that a single resource failure would cascade to affect a large percentage of your infrastructure, this is a strong signal that your architecture needs isolation boundaries or redundancy improvements. Consider using the Risk Radar to view related risks.

Using Simulations for Architecture Decisions

The Failure Simulator is a powerful tool for making informed architecture decisions:

Before Deploying New Infrastructure

Run simulations on the resources your new service will depend on. If a critical dependency has a high blast radius, consider adding redundancy before deploying.

During Incident Post-Mortems

After an outage, simulate the failure that occurred to compare the actual impact against what Guardian Pro predicted. This validates your dependency mapping and helps you identify gaps.

For Compliance and Audit

Many compliance frameworks (including AWS Well-Architected) require evidence that failure scenarios have been evaluated. Failure simulation results can serve as documentation of your resilience analysis.

When Planning Cost Optimisation

Before removing resources flagged as underutilised, simulate their failure to confirm they are truly not in any critical dependency path. This prevents accidental service disruptions during cost optimisation.

Simulation Limitations

The Failure Simulator analyses dependencies based on Guardian Pro's resource discovery and relationship mapping. Keep the following in mind:

Application-level dependencies -- The simulator analyses infrastructure-level dependencies (network, IAM, storage). Application-level dependencies (such as a microservice calling another microservice's API) are detected where AWS service integrations exist, but custom application routing may not be fully mapped.
External dependencies -- Dependencies on services outside your AWS environment (third-party APIs, on-premises systems) are not included in the simulation.
Point-in-time analysis -- Simulations run against the most recent scan data. If your infrastructure has changed since the last scan, run a new scan before simulating.

tip

For the most accurate simulations, ensure your resource discovery scan is up to date. You can trigger a scan from the Dashboard before running simulations.

Next Steps

Architecture Map -- Visualise your full infrastructure topology and the dependencies that the Failure Simulator analyses.
Risk Radar -- View architectural risks that the Failure Simulator can help you investigate further.
Health Score -- Understand how identified risks affect your overall infrastructure health.

Why Simulate Failures​

Running a Simulation​

Step 1: Select a Resource​

Step 2: Run the Simulation​

Step 3: Review Results​

Understanding the Cascade Visualisation​

Interpreting Impact Percentage​

Common Simulation Scenarios​

Database Failure​

NAT Gateway Failure​

Load Balancer Failure​

VPC or Subnet Failure​

Using Simulations for Architecture Decisions​

Before Deploying New Infrastructure​

During Incident Post-Mortems​

For Compliance and Audit​

When Planning Cost Optimisation​

Simulation Limitations​

Next Steps​