And while yes building across multiple regions and AZs is a thing, AWS has had a string of issues where US-East 1 has broader impacts, which makes things far less redundant and resilient than AWS implies.
All the identity and access services for the public cloud outside of China (aka "IAM for the aws partition" to employees) are centralized in us-east-1. This centralization is essentially necessary in order to have a cohesive view of an account, its billing, and its permissions.
And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.
During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones. When I worked there, I remember at least one case where my team's on-calls were advised not to close ssh sessions or AWS console browser tabs, for fear that we'd be locked out until the outage was over.
[Nitpick] There are a few more AWS partitions like GovCloud:
But then you want to use the same stack across providers and all the proprietary technologies (even hidden from you with things like terraform) are suddenly loosing their luster.
Folks built in other regions believing they were fully isolated only to discover later during an outage that they were not.
Sometimes the circular dependencies get almost cartoonishly silly.
Like, "One of the two guys who has the physical keys to the server cage in us-east-1 is on vacation. The other one can't get into his apartment because his smart lock runs into the AWS cloud. So he hires a locksmith, but the locksmith takes an extra two hours to do the job because his reference documents for this model of lock live on an S3 bucket."
I made that example up, but only barely.
That was a weird job, fun, it was a local machine room for a warehouse that originally held the IBM mainframe, it still held it's successor "the multiprise 3000" which has the claim to fame as being the smallest mainframe IBM ever sold. But now the room was also full of decades of artisanal crafted unix servers with pick databases. the pick dev team had done most the system architecture. The best way to understand it is that for them pick is the operating system, unix is a necessary annoyance they have to put up with only because nobody has made pick hardware for 20 years. and it was NFS mounts everywhere, somebody had figured out a trick where they could NFS mount a remote machine and have the local pick system reach in and scrounge through the remote systems data. But strictly read-only. pick got grumpy when writing to NFS not to say anything about how the other database would feel about having it's data being messed with. Thus the circular mount.
Still was not the worst thing I saw. I liked the one system with a SMB mount. "Why is this one SMB?" "Well pick complains when you try to write to a NFS mount, but it's NFS detection code does not trip on SMB mounts." ... Sighs "Um... I am no pick expert but you know why it does not like remote mounts right. SMB does not change that, Do you happen to get a lot of corrupt indexes on this machine?" "yes, how did you know"
At some point, the behaviour changed and locks starting conflicting. IIRC, we hit it when upgrading to Debian Etch and took the time to unwind the system and make pure NFS work properly for us. Plenty of people took the opposite approach, and fiddled with the config to make locking a noop on SMB. I know of at least one web hosting company who ended up having to restore a year's worth of customer uploads from backups as a result...
> Our primary and out-of-band network access was down, so we sent engineers onsite to the data centers to have them debug the issue and restart the systems. But this took time, because these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them. So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online.
There was one (later denied) report that a 'guy with an angle grinder' was involved in gaining access to the server cage.
I’ve always thought mission critical stuff needs two independent key holders, with key holes placed far apart enough to make it impossible for 1 person to reach both.
You actually have to present your photo ID at the site entry gatehouse, then again to the building entry guard (who will also check you have a work permit and a site-specific safety induction) then you swipe a badge at a turnstile to get from reception into the stairwell, then swipe your badge at a door to get into the relevant floor, then swipe your badge and key in a code to enter the room with the cages then you use the key.
[1] https://www.nationalmuseum.af.mil/Visit/Museum-Exhibits/Fact...
I guess it shows very few care enough to pay enough to make that a reasonable upgrade.
I'm glad I never had to get that deep into the failure chain.
When you dogfood your own Rube Goldberg machine.
I’m 99% ;) certain dependencies of foundational services are a well discussed topic
And honestly, everybody else's stuff is in use-1, so at least your failures are correlated with your customers lol.
Yeah, but why put your eggs in that basket? I moved all our services from east to west/oregon a decade ago and haven't looked back.
1. The severity and frequency of us-east-1 outages are vastly overstated. It's fine. These us-east-1 outages almost never affect us. This one didn't; not even our instances in the affected AZ. Only that recent IAM outage affected us a little bit, and it affected every other region, too, since IAM's control plane is centrally hosted in us-east-1. Everybody's uptime depends on us-east-1.
2. We're physically close to us-east-1 and have Direct Connect. We're 1 millisecond away from us-east-1. It would be silly to connect to us-east-1 and then take a latency hit and pay cross-region data transfer cost on all traffic to hop over to another region. That would only make sense if we were in both regions, and that is not worth the cost given #1. If we only have a single region, it has to be us-east-1.
3. us-east-1 gets new features first. New AWS features are relevant to us with shocking regularity, and we get it as soon as it's announced.
4. OP is right about the safety in numbers. Our service isn't life-or-death; nobody will die if we're down, so it's just a matter of whether they're upset. When there is a us-east-1 outage, it's headline news and I can link the news report to anyone who asks. That genuinely absolves us every time. When we're down, everybody else is down, too.
In fantasy magic dream land loads are distributed evenly across different cloud providers.
A single point of failure doesn't exist.
It worked out with my first girlfriend. The twins are fluent in English and Korean. They know when deploying a large scale service to not only depends on AWS.
Healthcare in the US is affordable.
All types of magical stuff exist here.
But no. It's another day. AWS US-East 1 can take town most of the internet.
But even then , the load balancer needs to run somewhere. Which becomes a new single point of failure.
I’m sure someone smarter than me has figured this out.
It's basically a wash for almost all organizations for twice the cost and effort.
But where does the load balancer actually run. Does load balancer main run on AWS, and load balancer backup on Oracle?
Looking at Azure and GitHub in particular. ;)
You were dating twins as a form of redundancy?!
Last i heard azure outage it wasn’t even on HN frontpage
I’ve heard people say that the underlying physical infrastructure is older, but I think that’s a bit of speculation, although reasonable. The current outage is attributed to a “thermal event”, which does indeed suggest underlying physical hardware.
It’s also the most complex region for AWS themselves, as it’s the “control pad” for many of their global services.
If your customers are clusterrd in Toronto and Montreal, it probably makes a lot of sense to use ca-central-1. If you've got a lot of customers in Western Canada, us-west-2 is gonna have better network latency.
Other than a couple regions that had problems with their local network infrastructure (sa-east-1 was like that), there's little or nothing to differentiate the regions in terms of physical infrastructure and architecture.
These bets aren’t as innocent as they seem because the bettors can often influence or change the outcome.
Imagine if the betting website itself shuts down because AWS is down. (half joking I suppose though)
> These bets aren’t as innocent as they seem because the bettors can often influence or change the outcome.
Overall I agree with your statement that these betting markets also are able to incentivize a lot of insider trading and one can say negative scenarios as this has given them an incentive to capitalize on that.
So did some cooling equipment fail here or was there an external reason for the overheating? Or does Amazon overbook the cooling in their data centers?
Cooling in datacenters is like everything else both over and under provisioned.
It's overprovisioned in the sense that the big heat exchange units are N+1 (or in very critical and smaller load facilities 2N/3N). This is done because you need to regularly take these down for maintenance work and they have a relatively high failure rate compared to traditional DC components and require mechanical repairs that require specialized labor and long lead times. In a bigger facility its not uncommon to have cooling be N+3 or more when N becomes a bigger number because you're effectively always servicing something or have something down waiting for a blower assembly which needs to be literally made by a machinist with a lathe because that part doesn't exist anymore but that's still cheaper than replacing the whole unit.
The system are also under-provisioned in the sense that if every compute capacity in the facility suddenly went from average power draw to 100% power draw you would overload the cooling capacity, you would also commonly overload things in the electrical and other paths too. Over provisioning is just the nature of the industry.
In general neither of these things poses a real problem because compute loads don't spike to 100% of capacity and when they do spike they don't spike for terribly long and nobody builds facilities on a knife-edge of cooling or power capacity.
The problem comes when you have the intersection of multiple events.
You designed your cooling system to handle 200% of average load which is great because you have lots of headroom for maintenance/outages.
Repair guy comes on Tuesday to do work on a unit and finds a bad bearing, has to get it from the next state over so he leaves the unit off overnight to not risk damaging the whole fan assembly (which would take weeks to fabricate).
The two adjacent cooling units are now working JUST A BIT harder to compensate and one of them also had a motor which was just slightly imbalanced or a fuse which was loose and warming up a bit and now with an increased duty cycle that thing which worked fine for years goes pop.
Now you're minus two units in an N+2 facility. Not really terrible, remember you designed for 200% of average load.
That 3rd unit on the other side of the first failed unit, now under way more load, also has a fault. You're now minus 3 in a N+2 facility.
Still, not catastrophic because really you designed for 200% of average load.
The thing is, it's now 4AM, the onsite ops guy can't fix these faults and needs to call the vendor who doesn't wake up till 7AM and won't be onsite till 9.
Your load starts ramping up.
Everything up above happens daily in some datacenter in the USA. It happens in every datacenter probably once a year.
What happens next is the confluence of events which puts you in the news.
One of your bigger customers decides now is a great time to start a huge batch processing job. Some fintech wants to run a huge model before market open or some oil firm wants to do some quick analysis of a new field.
They spin up 10000 new VMs.
Normally, this is fine, you have the spare capacity.
But, remember, you planned for 200% of AVERAGE cooling capacity and this is not nodes which are busy but not terribly busy, these are nodes doing intense optimized number crunching work which means they draw max power and thus expel max waste heat.
Not only has your load in terms of aggregate number of machines spiked but their waste heat impact is also greater on average.
Boom, cascading failure, your cooling is now N-4.
Server fans start ramping up faster which consumes more power.
Your cooling is now N-5.
Alarms are blaring all over the place.
Safeties on the cooling units start to trip as they exceed their load and refrigerant pressures rise.
Your cooling is now N-6.
Your cooling is now N-7.
Your cooling is now 0.
Reminds when i did noogler training back in the day and one of the talks described a cascading failure at a datacenter, starting with a cat which was too curious near a power conditioner, and briefly conducted
Its cold up here in the winter, sadly, the residual heat from even totally passive components like switch gear is enough to warm things up enough to attract them. .001% of 1MW of power is still quite warm. (I have no idea how much switchgear leaks but i know they are warm even in winter outdoors).
And, yeah, the rest of the writeup is also an amalgamation of some panic-inducing experiences in my life.
But they did load-shed. Perhaps not soon enough, but the reason this is publicly known is because they reduced the amount of heat being produced.
Right, exactly, I highly doubt the facility went into any kind of actual uncontrolled thermal rise. This is news because they had to take such drastic actions. I'm sure its common that they force spot prices up (probably way up) to compensate for reduced capacity due to events, I'm sure they even sometimes fake no capacity for similar reasons. No capacity means "I don't want to turn on your node" not merely "I don't have any more physical servers I could turn up for you".
This is news because they powered off some non-preemptible customer loads, which actually makes me wonder if you saw that chain of events occur here.
spot prices rise -> new instance availability goes to 0 -> preemptible instances go dark -> normal instances go dark.
A decade ago it was trivial to just tell the hypervisor to reduce the cpu fraction of all VMs by half and leave half unallocated. Now, it's much more complicated and definitely would be user visible.
Some fail below 100% too.
But this is the physical world, shit happens.
The algorithm didn't know that fuse was lose and fine at 50% duty cycle but was high resistance and going to blow at 100%.
It wasn't until a real incident that we learned: (a) the system wasn't resilient to the utility power going on-off-on-off-on-off as each 'off' drained the batteries while the generator started, and each 'on' made the generator shut down again; (b) the ops PCs were on UPSes but their monitors weren't (C13 vs C5 power connector) and (c) the generator couldn't be refuelled while running.
Even if you've got backup systems and you test them - you can never be 100% sure.
Turtles all the way down.
At AWS scale even unlikely hardware events become more common I guess.
They didn't say how, but apparently the pipes between each floor and the roof were not redundant. It took almost 24 hours to fix.
I find Hetzner's UI to be super-confusing, making it hard to manage things.
Ovh is way simpler, and openstack integration from them works good enough for most of my needs.
Kinda insane how atrocious docs are tho. No .md markdown format to let agents read stuff yet -_-
AWS EC2 outage in use1-az4 (us-east-1)
> AWS in 2025: The Stuff You Think You Know That’s Now Wrong
> us-east-1 is no longer a merrily burning dumpster fire of sadness and regret.
— https://www.lastweekinaws.com/blog/aws-in-2025-the-stuff-you...
Otherwise a good article!
I have systems running in us-east-1, and over the course of the incident, I noticed unexplainable intermittent connectivity issues that I've never seen before, even outside of az4.
Come and give me your cash if you want resilience.
Two loop cycle with heat exchanger to get rid of the heat
But NoVA is basically the same sort of economic cluster that Paul Krugman won his Nobel Prize in Economics for studying, just for datacenters, not factories.
There's a great read about the whole area here: https://www.amazon.com/Internet-Alley-Technology-1945-2005-I...
As for AWS, I often see it repeated that the DCs are the oldest and therefor in disrepair. That's not true; many of the first ones have since been replaced. But there are services that are located here and only here.
But I'll also add, a lot of customers default to using US-East-1 without considering others, and too many deploy in only one AZ. Part of this is AWS's fault as their new services often launch in US-East-1 and West-2 first, so customers go to East-1 to get the new features first.
Speaking as one who was with AWS for 10 years as a TAM and Well-Architected contributor, I saw a lot of customers who didn't design with too much resiliency in mind, and so they get adversely affected when east-1 has an issue (either regional or AZ). The other regions have their fair bit of issues as well. It's not so much that east-1 necessarily fails more than the others, it's that it has so many AZs and so many workloads that people notice it more.
Why is that? You would think the company ending events like IAM going poof due to it being dependent on us-east-1 would be top priority to fix?
If you're building a single datacenter site this is where you start building first.
Coastal land much more expensive. If you go to a remote coastal site, you probably won't have as good access to power.
Coastal sites usually exposed to more severe weather events.
Other fun unpredicatble things eg-Diablo Canyon nuclear facility has had issues with debris and jellyfish migration blocking their saltwater cooling intake.
https://www.nbcnews.com/news/world/diablo-canyon-nuclear-pla...
In one of the slides, there were factors that influence the decision of where to build a data center, and several of the items involved finding a place with enough space and skilled people to work at this data center. He also commented sometimes there is politics involved on choosing the place for a next data center.
Toronto is the textbook example of this working. It's on a freshwater lake that is deep relatively close to the shore, and the downtown has expensive real estate blocking traditional methods.
https://en.wikipedia.org/wiki/Deep_Lake_Water_Cooling_System
Cold water -> data centre cooling loop - > warm water -> paper mill with heat pumps to transform low-grade heat into the required temperatures -> profit
https://datacenters.google/locations/hamina-finland/
> Using a cooling system with seawater from the Bay of Finland and a new offsite heat recovery facility, our Hamina data centre is at the forefront of progressing our sustainability and energy-efficiency efforts.
https://netflixtechblog.com/the-netflix-simian-army-16e57fba...
When customers pay for cloud services, they expect them to be maintained by competent engineers.
edit: Not sure why the downvotes. If you fire the engineers that have been keeping your systems running reliably for years, what do you expect to happen?
This was one data centre in one zone of a multi-zone region.
Yes IAM/R53 and others are centralized there, yes, reworking those service to be decentralized and cross-region would be a Good Thing. But us-east-1 is already multi-zone (6 with a seventh marked as "coming in 2026") with multi DC within zones. From memory, when a global service like IAM is out, it's more likely to be bugs in the implementation or dependency than a "if this was cross-region it wouldn't have died" issue.
But this wasn't an outage of any AWS global service this time. The only one that seemed to have more impact was/is MSK. Which is likely to be more of an issue with Kafka than anything AWS related.