SEATTLE — The addition of new servers to Amazon’s dominant cloud-computing network triggered a cascading set of errors that took down large swaths of the web Wednesday, the company acknowledged.
Amazon said in a lengthy and technical blog post Saturday morning that a massive computing network in Northern Virginia began to fail after "a relatively small addition of capacity" it started to make to the system just before 6 a.m. Eastern on Wednesday. But because of "an operating system configuration," the new capacity set off a series of errors that overwhelmed Amazon's network of servers.
Within a few hours, the malfunctions began hitting customers of Amazon Web Services, the company's cloud-computing unit. Customers of the Amazon-owned Ring security camera service couldn't log in or watch video. Users struggled to operate their iRobot vacuum cleaners because the outage affect the iRobot Home App. And media companies, including The Washington Post (privately owned by Amazon Chief Executive Jeff Bezos), experienced publishing system outages.
Amazon acknowledged that the system failure was exacerbated by the co-dependencies its various services have on one another. The company had been trying to add capacity to its Amazon Kinesis service that customers use to process real-time data including video, audio and application logs. To resolve the issue, Amazon needed to restart a piece of its system it described as "many thousands of servers," a lengthy process that had to be done gradually. But because other Amazon cloud services rely on Kinesis, including its Cognito authentication offering, they failed as well.
And because Amazon uses Cognito itself to let customers know about the status of its cloud operations through its Service Health Dashboard website, it couldn't immediately update that site. The company has a backup method to update the site, but said "it is a more manual and less familiar tool for our support operators."
An Amazon spokeswoman didn't respond Saturday to a request for comment about the outage. In the blog post, the company pledged to do "everything we can to learn from this event."