Late last night, Amazon issued a statement explaining the cause of the problem that hobbled its S3 storage system yesterday morning. It was not a hardware failure. Rather, the service’s authentication system, which verifies the identity of a user, became overloaded with user requests. As one person explained to me, it amounted to a kind of accidental DDoS (distributed denial of service) attack, and Amazon didn’t have enough capacity in place in one of its data centers to handle the surge.
Here’s Amazon’s explanation:
Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types.
Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles. This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST. By 6:48am PST, we had moved enough capacity online to resolve the issue.
Amazon promises quick action to ensure the problem doesn’t happen again and that users are supplied with better information on system status:
As we said earlier today, though we’re proud of our uptime track record over the past two years with this service, any amount of downtime is unacceptable. As part of the post mortem for this event, we have identified a set of short-term actions as well as longer term improvements. We are taking immediate action on the following: (a) improving our monitoring of the proportion of authenticated requests; (b) further increasing our authentication service capacity; and (c) adding additional defensive measures around the authenticated calls. Additionally, we’ve begun work on a service health dashboard, and expect to release that shortly.
All in all, I think Amazon has handled this outage well. The problem revealed some flaws in Amazon’s otherwise highly reliable system, including shortcomings in its communications with users, and the company will make important improvements as a result, to the benefit of all its customers. These kinds of small but embarrassing failures – the kind that get you asking WTF? – can be blessings in disguise.
UPDATE: This post originally contained an excerpt from an internal Amazon email, with the subject line “WTF,” which traced the source of the outage to a barrage of authentication requests generated by “a single EC2 customer.” (EC2 is Amazon’s online computing service.) I decided to delete the email because an Amazon spokesman subsequently informed me that it appeared to have been written before a full analysis was done on the root cause of the outage and hence did not accurately portray that cause. I apologize for supplying what appears to have been misleading or at least incomplete information.