The original working title of my forthcoming book “The Big Switch” was “Dynamo,” so I was particularly interested to see that Amazon’s CTO, Werner Vogels, together with several of his colleagues, has published an extensive paper describing an innovative Amazon storage system called Dynamo. The system is used to support many of the most critical elements of Amazon’s operation, including shopping-cart processing. (Unlike another Amazon storage system, S3, Dynamo will not, says Vogel, be offered to the public as a utility.)
Most of the paper goes over my head, but I’m sure it will be of great interest to other engineers engaged in building the massive and massively reliable data-processing systems that will define the future of computing. Like Google, Amazon runs a vast online operation that requires, as the paper says, “an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously,” and reliability is maintained by software that is able to work around hardware failures seamlessly:
Dealing with failures in an infrastructure comprised of millions of components is our standard mode of operation; there are always a small but significant number of server and network components that are failing at any given time. As such Amazon’s software systems need to be constructed in a manner that treats failure handling as the normal case without impacting availability or performance.
One of those systems is Dynamo, which, as an alternative to rigid relational database systems, “has been the underlying storage technology for a number of the core services in Amazon’s e-commerce platform. It was able to scale to extreme peak loads efficiently without any downtime during the busy holiday shopping season. [For example,] Shopping Cart Service served tens of millions requests that resulted in well over 3 million checkouts in a single day …” Dynamo is designed to ensure that, typically, “99.9% of the read and write requests execute within 300ms” and that even higher tolerances can be achieved for particularly sensitive processes, thus guaranteeing the responsiveness and reliability that customers take for granted when shopping at Amazon’s site. It achieves such performance despite the fact that Amazon’s system is constructed of “standard commodity hardware components that have far less I/O throughput than high-end enterprise servers” and is characterized by huge swings in demand.
At the start of the last century, the great engineering project was the creation of an electric grid that could deliver power to millions of users with a reliability and an efficiency that were previously unthinkable. Today’s great engineering project, of which Amazon’s Dynamo is but one manifestation, is to build a computing grid that can achieve similar breakthroughs in the processing and delivery of information.
Amazon’s paper, which will be presented this month at the ACM Symposium on Operating Systems Principles in Stevenson, Washington, is also available as a pdf.