In each cluster's first year, it's typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will "go wonky," with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span, Dean said. And there's about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover.
A 50 percent chance that the cluster will overheat? This suggests that Google's approach, which packs 40 servers into each rack, is running pretty close to the edge in terms of thermal management. Or perhaps that Google has trouble anticipating when an area of its data center may develop cooling challenges.
There are two ways to look at Amazon.com: as a retailer, and as a software company that runs a retailing application. Both are accurate, and in combination they explain why Amazon, rather than a traditional computer company, has become the most successful early mover in supplying computing as a utility service. For Amazon, running a cloud computing service is core to its business in a way that it isn't for, say, IBM, Sun, or HP.
Now, if you were a customer who would you buy web services from? Amazon or Google.