Amazon Web Services experiences another big outage

It’s the latest of several recent AWS outages that took down large chunks of the digital economy. Two weeks ago, service problems tied to malfunctioning network devices knocked offline Amazon’s Ring doorbells and Roomba vacuums. Another outage occurred last week.

Cloud systems such as AWS allow companies to rent servers and computing power over the Web, and they’ve revolutionized the Internet with promises of a reliable online backbone, available at any minute.

But the outages have underscored how this consolidation of the Internet’s once-distributed capabilities also means that a single failure can lead to wide-ranging, ripple effects, weakening the hidden backbone undergirding much of the Web.

“A single glitch in a high-profile provider will have huge implications on countless organizations of all sizes, in often very unexpected ways,” said Ed Skoudis, president of the SANS Technology Institute. “Service interruptions are vast and impact thousands of companies and millions of users. We are putting more eggs into fewer and fewer baskets. More eggs get broken that way.”

Amazon did not immediately respond to requests for comment. Amazon founder Jeff Bezos owns The Washington Post.

Reliably keeping a giant “cloud” of international data centers online is tough, said Steven Bellovin, a computer science professor at Columbia University. Every change must be tested before it’s deployed and closely monitored afterward, with an automatic way to back out in case of problems and a safety net of redundant software and backup servers, just in case.

Amazon has not released technical details on the underlying faults, and occasional outages are expected. But so many errors in a short time suggest that some of the backup systems might be inadequate to the task, Bellovin said.

“The short answer is that I’m disturbed,” he added. “I’ve long been a fan of cloud services … and it’s possible that this is just malign coincidence for Amazon … but if they can’t accommodate growth, they’re in a bad place.”

AWS is the world’s largest provider of cloud-computing services, with 40 percent of the worldwide market last year for infrastructure cloud services, according to the market research firm Gartner. Microsoft was a distant second, with roughly 20 percent.

But moving among the biggest cloud-computing services — Amazon’s AWS, Microsoft’s Azure and Google Cloud — is a challenge, because each system works differently and relies on its own infrastructure.

More companies, Skoudis said, are starting to talk about using multiple cloud systems simultaneously, even though the approach is pricey and “a little ridiculous, given how the cloud was advertised as giving us reliability and affordability.”

The causes for the three outages this month reveal how the cloud’s increasing intricacy and demands have led to more potential for disaster. The five-hour outage Dec. 7, AWS engineers wrote in a postmortem, was caused by a glitch in some automated software that led to “unexpected behavior” that then “overwhelmed” AWS networking devices and hit computer systems on the East Coast.

The second outage, which lasted for less than an hour Dec. 15, affected mostly West Coast devices and was blamed on “network congestion” due to some internal engineering that “incorrectly moved more traffic than expected to parts of the AWS backbone that affected connectivity,” according to a company statement.

During Wednesday’s outage, which Amazon said was due to data center power issues, users on Downdetector, a site for measuring Internet outages, said they had trouble accessing sites including the video-streaming service Hulu and the investment site Fidelity.

Last year, huge swaths of the Web were knocked offline after Amazon’s Northern Virginia servers became overwhelmed. And Skoudis suspects more issues will arise as the Internet grows more complex.

“In the IT field, we sometimes joke about how we spend 15 years centralizing computing, followed by 15 years decentralizing, followed by another 15 years centralizing again,” he said. “Well, we have spent the past 10 years centralizing again, this time on [the] cloud.”