AWS’s S3, a web-based, object storage service experienced widespread issues last Tuesday morning. The S3 outage was due to “high error rates with S3 in us-east-1,” per Amazon’s AWS service health dashboard.
Some of the many affected websites and services (that we know about) include Quora, newsletter provider Sailthru, Business Insider, Giphy, image hosting at a number of publisher websites, file sharing in Slack, and many more. Connected light bulbs, thermostats and other IoT hardware is also being impacted, with many unable to control these devices because of the outage. Amazon S3 is used by around 148,213 websites, and 121,761 unique domains, according to data tracked by SimilarTech (though these numbers seem low).
The internet was not happy with the outage; snarky comments were flying on the Twittersphere, with the primary complaint around a lack of information flow from AWS during the outage. Most learned about the issue on Twitter before they saw a notification on the AWS service health dashboard. As it turned out, some of the dashboard’s functionality also depends on S3, which is why notifications came late and failed to fully describe the nature of the impact, Amazon later explained. A few people decided to put a positive spin on the outage, terming it a “digital snow day”.
A few days later, Amazon announced the cause of the outage – human error. The company said that one of its employees was debugging an issue with the billing system and accidentally took more servers offline than intended. That error started a domino effect that took down two other server subsystems and so on and so on.
“Removing a significant portion of the capacity caused each of these systems to require a full restart,” the post read. “While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the us-east-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.” Read the full AWS post here.
It should be no surprise that the company is making some changes to ensure that a similar human error wouldn’t have as large an impact. One change is that the tool employees use to remove server capacity will no longer allow them to remove as much as quickly as they previously could. They are also making changes to prevent the AWS Service Health Dashboard from malfunctioning in the event of a similar occurrence.
It just goes to show, it doesn’t matter how big, robust and automated a service becomes, it still only takes one human with admin privileges to bring it all crashing down. Even a system like S3, which is distributed across multiple Availability Zones, can experience an outage. Architecting across regions or even across cloud providers is crucial to ensuring 24 X 7 X 365 uptime.
For assistance with building a multi-region, highly available AWS environment, contact us here.