If you’ve worked in networking long enough, you’ve probably taken down production at least once. I know I have. That’s why I believe in introducing a little controlled chaos—not to be reckless, but to build resilience and make sure we never walk out of a change window with a broken network.
Most of my career has been on the data center side, with years of Cisco and CLI habits that never really go away. I’ve always relied on baselining, validation, and proving intent before and after every change. Now I’m applying that same mindset to AWS networking, where you can’t just SSH into a box and run your favorite show commands, but the need for confidence in how the network will behave is just as critical.
1. Get your bearings with a real network view
Before I touch anything, I need situational awareness. In the data center, that meant topology diagrams, routing tables, and a mental model of how packets actually flow. In the cloud, if you don’t deliberately build that picture, you’re flying blind and hoping abstractions don’t bite you.
I start by using AWS Network Manager to build a dynamic visualization of the environment:
- VPCs, Transit Gateways, VPNs, Direct Connect, TGW Connect
- On-prem sites and devices for true end-to-end context
- Multi-account environments (with the right IAM access)
- Automatic updates as attachments, links, and routes change
This gives me a live, global view I can take into a change review and say, “This is the network as it exists right now.”
2. Establish guardrails (and know what they can’t do)
Once I understand the topology, I want to understand behavior. What does ‘healthy’ look like when everything is working, and what traffic patterns represent the business actually doing its job? Without that baseline, every outage looks the same: panic and finger-pointing.
Observability tools like VPC Flow Logs and synthetic transactions help by showing steady-state behavior:
- Who’s talking to whom
- Source, destination, and 5-tuple data
- Top talkers and high-value flows
But they also have limits:
- They can be expensive to run continuously at scale
- During an outage, “no traffic” doesn’t tell you why
- Data is scattered across VPCs and accounts
They’re necessary—but they’re not sufficient for change validation.
3. Shift from “change validation” to
intent validation
Most change reviews focus on mechanics: what routes are changing, what security groups are updated, what attachments are moving. I want to flip that around and focus on outcomes: what connectivity must still exist when the change is done.
Intent means defining:
- Source (interface / IP)
- Destination
- Ports and protocols
- Expected reachability
What happens in the middle is implementation detail. What matters is: does the intended connectivity still work?
4. Run intent checks with Reachability Analyzer
This is where I replace tribal knowledge and late-night bridge calls with something deterministic. I want a tool that can tell me, with precision, whether the network can deliver on the intent I just defined.
Using AWS Reachability Analyzer, I run point-in-time checks that:
- Return a clear pass / fail
- Show the exact path through the network
- Traverse:
- ENIs
- Route tables
- Security groups
- Network ACLs
- Transit Gateways and attachments
- Pinpoint where and why a flow is blocked
Each check represents one piece of network intent, validated before and after a change.
5. Introduce controlled chaos (safely)
Resilience isn’t proven by success—it’s proven by how you detect and recover from failure. That’s why I deliberately break things in a controlled environment, the same way we run disaster-recovery tests instead of assuming backups will work.
Examples of realistic “chaos” I test:
- Misconfigured routes or black-hole routes
- Broken TGW or VPC attachments
- Security group or NACL blocks
- ALB target misconfigurations
- Firewall policy errors
The workflow is simple:
- Run pre-change intent checks
- Make the change (or simulated failure)
- Run post-change intent check
- If anything fails, immediately see:
- Which flow is broken
- Where in the path it failed
Now I’m not blind, and I’m not guessing.
6. Automate it in a pipeline
Once the process works manually, the next step is to remove humans as the weak link. If validation depends on someone remembering to run a check at 2 a.m., it will eventually be skipped.
In a CI/CD or IaC pipeline (Terraform, CloudFormation, etc.):
- Run reachability checks (baseline)
- Apply the network change
- Run the same checks again
- If any check fails:
- Pause for human review, or
- Automatically roll back
This enforces the rule: never exit a change window with a broken network.
7. Scale intent checks using real traffic
In large environments, the hard part isn’t running checks—it’s knowing which checks matter. You can’t protect what you haven’t identified as critical.
Use observability sources to identify:
- Top talkers from Flow Logs
- High-hit firewall rules
- Critical synthetic transactions
Then turn those into Reachability Analyzer intent tests:
- Source → Destination
- Port / Protocol
- Expected path
Over time, you build a living library of critical flows that can be validated automatically on every change.
Final takeaway: Schedule a “Chaos Hour”
Just like you rehearse incident response and disaster recovery, you should rehearse network failure and recovery. Make it routine, not heroic.
Treat this like a DR exercise:
- Run it monthly or quarterly
- One person introduces a failure
- Another proves and fixes it using intent checks
- Start in a sandbox or digital twin, not production
I’ve shared a GitHub repo and runbook (via QR code in the session) with:
- Sample environments
- Randomized break scenarios
- Terraform-based reachability pipelines
The goal isn’t to break your network for fun. It’s to make sure that when something does break, you already know exactly how to prove it, find it, and fix it.


