Everyone wants to survive a data center outage, but as AWS outage shows, not all do survive. Here is a post that summarize best practices in SW architecture to survive an outage like AWS.

Retrospect on recent AWS outage and Resilient Cloud-Based Architecture

Thursday, June 9, 2011 at 8:19AM

A bit over a month ago Amazon experienced its infamous AWS outage in the US East Region. As a cloud evangelist, I was intrigued by the history of the outage as it occurred. There were great posts during and after the outage from those who went down. But more interestingly for me as architect were the detailed posts of those who managed to survive the outage relatively unharmed, such as SimpleGeo, Netflix,SmugMug, SmugMug’s CTO, Twilio, Bizo and others.

The list of best practices are:

The main principles, patterns and best practices are:

Design for failure

Stateless and autonomous services

Redundant hot copies spread across zones

Spread across several public cloud vendors and/or private cloud

Automation and monitoring

Avoiding ACID services and leveraging on NoSQL solutions

Load balancing

If this seems daunting, there are new services coming to provide scalability and availability services.

The emerging solution to this complexity is a new class of application servers that offers to take care of the high availability and scalability concerns of your application, allowing you to focus on your business logic. Forrester calls these "Elastic Application Platforms", and defines them as:

An application platform that automates elasticity of application transactions, services, and data, delivering high availability and performance using elastic resources.