Amazon posted a while ago its post-mortem on the AWS outage. One of the entertaining ways to look at the Summary is the number of times "automat*" gets used.
Here are a few examples.
For these database instances, customers with automatic backups turned on (the default setting) had the option to initiate point-in-time database restore operations.
RDS multi-AZ deployments provide redundancy by synchronously replicating data between two database replicas in different Availability Zones. In the event of a failure on the primary replica, RDS is designed to automatically detect the disruption and fail over to the secondary replica. Of multi-AZ database instances in the US East Region, 2.5% did not automatically failover after experiencing “stuck” I/O. The primary cause was that the rapid succession of network interruption (which partitioned the primary from the secondary) and “stuck” I/O on the primary replica triggered a previously un-encountered bug. This bug left the primary replica in an isolated state where it was not safe for our monitoring agent to automatically fail over to the secondary replica without risking data loss, and manual intervention was required. We are actively working on a fix to resolve this issue.
So, AWS figured out there was a bug in the monitoring agent to automatically fail over.
This bug left the primary replica in an isolated state where it was not safe for our monitoring agent to automatically fail over to the secondary replica without risking data loss, and manual intervention was required. We are actively working on a fix to resolve this issue.
And, they are going to fix the problem with more automation.
We will audit our change process and increase the automation to prevent this mistake from happening in the future.
Here are a few more areas where automat* is mentioned.
We’ll also continue to deliver additional services like S3, SimpleDB and multi-AZ RDS that perform multi-AZ level balancing automatically so customers can benefit from multiple Availability Zones without doing any of the heavy-lifting in their applications.
Speeding Up Recovery
We will also invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster. We have a number of operational tools for managing an EBS cluster, but the fine-grained control and throttling the team used to recover the cluster will be built directly into the EBS nodes. We will also automate the recovery models that we used for the various types of volume recovery that we had to do. This would have saved us significant time in the recovery process.
With automat* mentioned so many times, it makes you think there is a lot of manual work going on in AWS.
If you want an automated private cloud, you could learn from some of the original AWS EC2 team as they started the company http://nimbula.com/