Internet Outage: What Really Happens?

BarcelonaThis 2025 that is leaving has reminded us with insistence that we live in a society dependent on the internet. Digital networks and services have experienced some of the most massive outages in recent history, with hundreds of millions of users affected around the world. Most worryingly, many crashes have been caused by human error, poorly deployed updates or faulty configurations. Where is that distributed internet that was supposed to be resilient?

Register for the newsleter Series
All premieres and other gems


Sign up for it

The most serious incident was on October 20, when Amazon Web Services (AWS) generated more than 17 million incident reports to DownDetector. But this number only includes users so busy that they had the courage to enter a third-party website and press the red button. It is estimated in the industry that for every person who actively complains, there are between 20 and 100 who simply shut up and reset their router. With a conservative extrapolation, we are talking about 850 million people affected: one in five Internet users on the planet was left hanging because of an error on Amazon.

The outage, lasting more than 15 hours, originated in the automated DNS management system linked to the DynamoDB database in the US-EAST-1 region. This is the internet’s messy storage room in Virginia, where layers of outdated technology pile up. A single point of failure that every system architect knows to avoid, but everyone uses because it’s where Amazon deploys new things first.

The problem was of a technical stupidity that is hard to believe. An automated update caused two processes to attempt to write to the DNS system at the same time. Instead of firing first one and then the other, the system messed up and decided to delete the routes. Result: The database worked perfectly, but no one knew how to get to it. It was as if someone had deleted the AP-7 file from the GPS; the highway is there, but the cars can’t find it.

The supreme irony is that the AWS dashboards themselves also depended on this DNS. When Amazon engineers tried to get into the system to fix the problem, they couldn’t access it. Among the affected users, those on Eight Sleep connected beds were unable to activate them – some people need AWS to sleep.

The second most notorious incident was that of the PlayStation Network on February 7, with almost 4 million complaints. Its 116 million monthly users went 24 hours without being able to play, the second longest slump in PSN history since 2011. Most frustrating: It coincided with the launch of the Monster Hunter Wilds beta. PSN’s downfall is a brutal reminder: we don’t own our games. When you buy a game for 70 euros in the digital store, you buy the right to play it for as long as Sony wants and can keep the server on. Even single player games would fail if they had to connect to validate the license.

Cloudflare, a company that is dedicated precisely to protecting the internet from crashes, had significant outages. The one on November 18 affected Spotify, ChatGPT and Discord for almost five hours. The reality was prosaic: an engineer applied an update to the database that manages bot detection. The change caused an internal query to return duplicate data, and this caused a configuration file to grow beyond the limit that the software could read. When Cloudflare’s thousands of servers received this overly large file, the software panicked and the servers went into an endless loop of restarts. CEO Matthew Prince admitted it was “the worst incident since 2019”.

The year has also been tough here. On April 28, Spain and Portugal experienced the biggest blackout in recent history. Internet traffic dropped by 80% to 90% for more than 36 hours. Mobile networks shut down as backup batteries ran out. The Spanish economy suffered losses estimated at 1.6 billion euros.

Three weeks later, on May 20, Spain was again half-off due to a Telefónica network upgrade that went wrong. Madrid, Barcelona, ​​Valencia, Seville and Bilbao reported massive falls. The 112 emergency number stopped working in many autonomous communities. “All services have been restored, except for a couple,” Telefónica’s chief operating officer said afterwards, with impressive naturalness.

Why falls affect so many people

The answer is as simple as it is troubling: the internet is much more centralized than we want to believe. Companies like AWS, Cloudflare, Microsoft Azure or Google Cloud dominate the market. When one of them goes down, it drags down thousands of apps that depend on it. AWS has approximately 32% of the global cloud computing market. When their service goes down, platforms like Netflix, Spotify or Roblox are inaccessible. The October incident affected Delta, preventing passengers from checking in. Cloudflare, on the other hand, provides services to millions of websites. When their systems fail, unrelated websites disappear simultaneously.

Several distributed monitoring systems are combined to detect these interruptions. Platforms like ThousandEyes and Catchpoint use thousands of global monitoring points that analyze billions of measurements every day using protocols like BGP (Border Gateway Protocol) and DNS. When there are anomalous changes in BGP routes, systems can detect outages within minutes.

DownDetector, owned by Ookla, takes a different approach: it aggregates notifications from affected users. It is less technically precise but very effective in measuring real impact.

When a massive crash is detected, a race against time begins. Engineers must first identify the cause of the problem in immensely complex systems. Modern businesses use systems to roll back the most recent changes. It took Cloudflare more than five hours because the corrected configuration had to be propagated to all of its global data centers.

In case of electrical breakdowns, recovery is slower and more physical. Operators must reset node by node, antenna by antenna. Backup batteries give approximately eight hours of autonomy, but in extended blackouts they fall short.

The lessons of 2025

This year has taught us that perhaps we should reconsider our blind trust in the cloud. US Senator Elizabeth Warren summed it up after the AWS incident: “If one company can break the entire internet, it’s too big. Period.” We also learned that human errors are inevitable, but recovery systems are too slow. When a misconfiguration can put millions of users out of service for hours, we need to rethink how we deploy updates to critical infrastructure.

We have discovered that the promise of a distributed Internet architecture is more of a marketing slogan than a reality. Three or four companies control the essential infrastructure of the global network. Automation, which had been sold to us as the solution to human error, has become an error amplifier that propagates failures at the speed of light through thousands of servers before any human can say “Ep, stop machines.”

2025 is not over yet, but it has already been eloquent enough. The internet is very useful when it works, but catastrophically useless when it doesn’t. And it depends more and more on fewer hands. What tranquility

Aiko Tanaka

Aiko Tanaka is a combat sports journalist and general sports reporter at Archysport. A former competitive judoka who represented Japan at the Asian Games, Aiko brings firsthand athletic experience to her coverage of judo, martial arts, and Olympic sports. Beyond combat sports, Aiko covers breaking sports news, major international events, and the stories that cut across disciplines — from doping scandals to governance issues to the business side of global sport. She is passionate about elevating the profile of underrepresented sports and athletes.

Leave a Comment