When the Cloud Cracks: Lessons from the AWS Outage

21 October 2025 - 5 Minute Read

Yesterday’s major AWS outage was another wake-up call: scale doesn’t always mean stability, and outsourcing doesn’t erase risk. As more organisations push their critical systems into public cloud platforms, we’re discovering that consolidation can amplify exposure rather than reduce it.

The Outage in Brief

On 20 October 2025, the AWS US-East-1 region (Northern Virginia) began experiencing widespread service failures from around 07:11 UTC (03:11 EDT). AWS later confirmed that an internal fault within the Network Load Balancer monitoring subsystem triggered DNS and control-plane instability. In plain terms, services couldn’t reliably find or talk to each other.

The impact was enormous: millions of users across thousands of applications were affected. Major consumer and enterprise platforms such as Snapchat, Reddit, Roblox, Fortnite, Venmo, Coinbase, Airtable, Canva and Zapier suffered downtime. Even HMRC and several UK banks, including Lloyds and Halifax, reported disruption.
Amazon’s own ecosystem wasn’t immune either - Alexa, Ring, and parts of its retail operation all stumbled before recovery later in the day.

Image

What Actually Failed and Why it Rippled so Far

The technical failure originated in AWS’s internal control systems rather than customer workloads. Because so much of the modern internet depends on US-East-1 for authentication, logging, and orchestration, a regional hiccup turned into a global event.
This isn’t new, we’ve seen similar patterns in 2021, 2022 and 2023, but the difference now is how much more business-critical activity sits in the cloud. The outage demonstrated how easily a regional fault can cascade across continents and industries.

Cloud Concentration: When Big Fish Enter the Ocean

The cloud promised resilience through scale. Yet the more we centralise workloads with a handful of hyperscalers, the more correlated our risks become. When failure strikes, your company, no matter how large, becomes just another tenant in a vast, noisy ocean.

During an incident, communications prioritise the many, not the few. Updates flow through dashboards and status pages rather than direct human contact. If your business and board are comfortable with radio silence and waiting for the next update, hyperscale might be fine.
But if you need real-time context, clear accountability, and a voice at the end of the phone to explain what’s happening, you may prefer to be the big fish in a smaller pond.

Cloud Isn’t a Panacea, Failure is Inevitable

Moving to cloud never eliminated failure; it just changed its shape. Outages are as certain as cyber-attacks - not if, but when. What matters is how your architecture, your supplier mix, and your operating model respond when the lights flicker.

Specialist Infrastructure Needs Specialist Partners

For organisations running IBM Power, AIX, or mainframe workloads, the challenge is even sharper. These systems underpin core banking, retail, and manufacturing operations where downtime isn’t just inconvenient, it’s existential.
Not every cloud provider understands the nuances of these platforms or offers the same performance and recovery characteristics. Selecting (and regularly reassessing) your provider is critical:

  • Ensure they truly support IBM Power and Z environments rather than emulating them.
  • Confirm you have direct, skilled escalation routes, not just ticket portals.
  • Validate the SLA response for infrastructure-level incidents, not just VM uptime.
  • Revisit contracts annually, especially after M&A or organisational changes, to confirm that promised service levels still hold true.

Even with smaller providers, as their business grows or ownership changes, culture and responsiveness can drift. Regular independent reviews and audits are the only way to stay ahead of that curve.

Reducing Your Exposure: A Practical Playbook

  1. Separate your failure domains
    Multi-AZ deployment is essential; multi-region architecture for critical systems should be your next step.
  2. Design for control-plane and DNS failures
    Use regional endpoints where possible, and build in circuit-breakers, retries, and graceful degradation.
  3. Abstract state and identity
    Replicate data stores and cache credentials locally so you can operate through temporary outages.
  4. Diversify where it matters
    Blend hyperscale and specialist or sovereign clouds to suit different risk profiles. Be intentional about where critical workloads live.
  5. Plan the communication chain
    Create runbooks not only for technical recovery but also for executive and customer communication when your provider goes dark.

Choosing the Right Balance

Smaller or specialist providers often serve fewer customers, meaning shorter escalation paths and genuine human engagement during an incident. The trade-off is fewer managed services and smaller geographic footprints, but for some workloads, that’s a price worth paying.

The point isn’t “big cloud bad, small cloud good”. It’s about engineering control back into your business, choosing partners whose scale, expertise, and communication model fit your operational risk tolerance.

Baby Blue’s Take

At Baby Blue IT & Consulting, we see outages like this as proof that resilience is a strategy, not a setting. Whether you’re running hybrid IBM Power environments, mainframes, or x86 compute in the public cloud, your cloud footprint should evolve with your business, not drift behind it.

We help organisations review and audit their cloud and infrastructure providers, test their support models, and re-engineer risk so that when the next outage happens, and it will, you’re ready to explain, recover, and keep operating.

If you’d rather be a big fish in a smaller, well-understood pond than a silent one in an endless ocean, it might be time to reassess your cloud choices: Contact Us.

About the Author

Chris Smith

Chris Smith is a sales leader and consultant with over 30 years of experience in IT managed services. With a background in IBM hardware maintenance, he transitioned from field engineer to sales and marketing director, creating the foundations for Blue Chip Cloud, which became the largest IBM Power Cloud globally at the time. Chris played a key role in the 2021 sale of Blue Chip and grew managed services revenue by 50%. He’s passionate about building customer relationships and has implemented Gap Selling by Keenan to drive sales performance. Now, Chris helps managed service providers and third-party maintenance businesses with growth planning and operational improvement.

LinkedIn

How can we help your business?

Contact Us to see how our services align with your needs and projects.

Baby Blue logoIBM Silver Partner

Website Design by Thomas Price