Crash: Amazon’s S3 utility goes down

There are reports that Amazon’s Simple Storage Service – called S3 – suffered a “massive” outage this morning, beginning at about 7:30 am eastern time. At 9:03, an Amazon official posted a note in the service’s customer forum, saying, “We can confirm the high error rate you’re experiencing. While we don’t have an ETA at this point, we’re working as quickly as possible to restore performance. We’ll provide updates as soon as we have them.” A poster at Hacker News reports that the service is now “back up,” but another poster says that service remains “spotty.”

The outage, which may have spread to other Amazon web services, appears to be affecting many web businesses, including prominent ones like Twitter, which uses the service to store the images that appear on its site. Writes one blogger: “Amazon S3 goes down … panic ensues.” But another S3 customer, also posting in the service’s forum, is more sanguine: “This is the first outage I have experienced since I joined the service nearly a year ago. Yes it sucks, yes I hope they get it fixed very soon… but, the sky is not in fact falling at the moment.”

As someone who believes in the growth of the utility mode of computing, I feel compelled to point out the inevitable glitches that are going to happen along the way. How the supplier responds – in keeping customers apprised of the situation and explaining precisely what went wrong and how the source of the problem is being addressed – is crucial to building the trust of current and would-be users. When Salesforce.com suffered a big outage two years ago, it was justly criticized for an incomplete explanation; the company subsequently became much more forthright about the status of its services and the reasons behind outages. Given that entire businesses run on S3 and related services, Amazon has a particularly heavy responsibility not only to fix the problem quickly but to explain it fully.

UPDATE: As of 10:17, Amazon reports, “We’ve resolved this issue, and performance is returning to normal levels for all Amazon Web Services that were impacted. We apologize for the inconvenience. Please stay tuned to this thread for more information about this issue.” An S3 user suggests: “A health monitor would be useful – something to show what amazon thinks the status of the services are and to post official information. Maybe even proactive alerts or something I could tie our other infrastructure notifications into so I could be proactive in alerting our downstream affected users.” Another complains: “Amazon’s response was substandard in this case. I should, minimally, see a message on the front page at aws.amazon.com when there’s a complete outage.” I would expect that Amazon will roll out additional tools for monitoring service status and alerting users about problems in fairly short order.

16 Comments

Filed under Uncategorized

16 Responses to Crash: Amazon’s S3 utility goes down

  1. I am really surprised by the news.

    What the heck is going on? Amazon S3 suffered massive outage? Massive? Really?

    I thought something like S3 would be atleast distributed in such a manner that there won’t be a massive failure.

    It clearly affects many big web services. I think this will turn away many start-ups from S3. Now we know that they are not reliable as previosly ‘assumed’.

  2. Sergey Schetinin

    My company is an AWS-based solution provider and let me tell you AWS does have short interruptions in service from time to time, but this one is the biggest by far. It’s still completely down as far as I can see and has been for many hours. Their forum, which is the primary way of communication with AWS team was read-only for a few hours as well.

    Amazon has a SLA, but you can only apply if you keep more or less complete log of all requests to the service, which “coincidentally” is impossible if you use it for directly serving files for you website or otherwise don’t control the client-side.

    Anyway, what I’m most concerned at this moment is that Amazon could have lost some of the data as well. The scope of the outage suggests this could happen. Last September Amazon already wiped out a number of EC2 instances for no other reason than a bug in its management software (EC2 is still in beta though).

  3. Hopefully this is an aberration, but it does tell you that Amazon probably needs to build some redundancy into S3 as the services grows into something really important for the web.

  4. Sam

    It’s a major hurdle we *ALL* will have to jump through toward the maturation of cloud computing.

    The party who solves the systematic risk & liability parts of the equation (better than the banks & monolines; and better than the well-known perpetual beta) will add much value to the cloud enterprise.

    Will it be Lloyd’s or someone like Goldman Sachs who has both the rocket science AND the common sense?

    Unless, of course, we systematically reduce our service expectations re uptime | redundancy | security | to whatever the cloud generally provides.

    Caveat emptor & save hardcopies.

  5. Was this across all datacenters? I thought this was the selling point is that your data is balanced across all data centers and if one goes down your fine..

    Ouch Amazon…..definitely a black eye.

  6. Nick, with salesforce.com earlier this week, Google, Six Apart and others before, we are holding SaaS and utility computing to an unreasonably high standard compared to rest of industry in uptime and quality…as I wrote below in Children of a lesser God? earlier this week…

    http://dealarchitect.typepad.com/deal_architect/2008/02/children-of-a-l.html

    the airline industry reports on-time stats and Southwest is a high standard but it is nowhere near 100%…

  7. RichM

    Kin: Some users are reporting the Amazon’s European data centers were unaffected. Amazon expanded its utility computing infrastructure into Europe late last year to accommodate regulatory guidelines for European users. No help to US users, though.

  8. “inevitable glitches”

    It not about inevitable glitches of utility computing. It’s about infrastructure delivery maturity. AWS does not even have leveled support. Why would anyone build a production environment on a platform infrastructure that doesn’t a support center. There are utility “cloud” provider that do have a support infrastructure.

    johnmwillis.com

  9. Nick Carr

    we are holding SaaS and utility computing to an unreasonably high standard compared to rest of industry in uptime and quality

    Holding them to a very high standard of service, whether reasonable or not, is probably the best thing we can do for these companies today. It’s only by delivering unusually good service that they’ll be able to convince a lot of companies to take the utility route. Unreasonable demands can be the mother of invention.

  10. It’s only by delivering unusually good service that they’ll be able to convince a lot of companies to take the utility route.

    I mostly agree, but I think that most of the discussions are missing the real point – whether you go cloud, in-house-classic, or some mix the final responsibility for reliability must be with the app owner.

    More here.

  11. Nick, it is good to be demanding. CIOs pay me to negotiate that with vendors. But not at the pricing amazon and sfdc and other Saas/Utility players are charging.

    If you take the average on-premise outsourcing contract with a CSC or it is not global, most need coverage 16×5 with 5-6 hours Saturday/Sunday, has plenty of downtime cushion built in which is not subject to the SLA etc…so on a global 24x7x365 grid, the AVERAGE would not even have 90% uptime. Of course they have global clients who want 24x7x365, but the average client demands far less.

    And compared to what amazon charges for processing and storage and prorating the hosting and tuning that comes with SaaS, it is tiny fraction of what an EDS or ACS or IBM charge clients.

    So let’s not build Kristal tastes and expect to pay malt liquor pricing…

  12. BXL

    I agree with Mr. Carr, above. We SHOULD hold cloud and utility computing to a higher standard. When A.G. Bell first told Watson he needed him, he was lucky that the phone worked for a couple of seconds before it died. The world considered it a phenomena. Not too long after, the world would be outraged over a single instance of lack of connectivity.

    Computing is no different and is quickly evolving to be an accepted, always on, utility, like the phone.

    So, we are saddened by this black eye. Amazon has paved the way for utility computing to evolve from a phenomena to a commodity. It has provided great value with pretty good reliability, so to bash it too hard for this outage may be unfair.

    But, you can have your cake and eat it too. Hosting providers offering 3tera’s operating platform, AppLogic, can be relied on for robust, mission critical applications. AppLogic gives you many nines out of the box. It supports standard multi-tiered applications and relational databases. Check it out!

  13. pwb

    Two points:

    Amazon services will benefit from replacement services arising. The value of being able to take your business elsewhere should never be underestimated.

    It’s interesting that people are so much harsher on their third party services than on their own. The average S3 user has their own servers go down much more but goes ballistic when Amazon goes down.

  14. does anyone know why the Amazon crash of 6/6/2008 didn’t crash their webservices?

    it crashed their website, Does this mean their web services doesn’t flow through their web site?

    I heard they are using REST technology for webservices, which is typically flowed throught he web site https services.

  15. Hi,

    I just posted an article on ‘Cloud Availability’ at the following URL, would like to hear your comments on the same.

    http://mukulblog.blogspot.com/2008/07/cloud-availability.html

    Thanks,

    Mukul.