Amazon’s Data Center Container "Perdix" something we haven’t seen

June 8, 2011 Dave Ohara

Yesterday I went to Amazon’s Technology Open House.

Here is a 1/4 of the crowd getting food and drinks early before James Hamilton’s keynote.

In James’s presentation he has a section on Modular & Advanced Building Designs

Every day, Amazon Web Services adds enough new capacity to support all of Amazon.com’s global infrastructure through the company’s first 5 years, when it was $2.7 billion annual revenue.

James presents his latest observations on data center costs.

And waste in mechanical systems.

But, here is something I didn’t expect. Amazon Perdix. Amazon’s version of modular pre-fab data container data center. The below picture has Microsoft’s design on the left and Amazon’s on the right.

James is a believer in low density, 30 servers per rack where the cost per server is $1,450 or less.

James Hamilton Keynotes, Amazon Technology Open House - Jun 7, 2011

June 1, 2011 Dave Ohara

If you are in Seattle you should try to go to Amazon's Technology Open house on June 7, 2011.

Amazon Technology Open House – June 7, 2011

Join us in one of our newest building locations on Amazon’s South Lake Union campus to hear from Amazon’s Distinguished Engineer, James Hamilton and network with teams from across the business including Amazon Web Services, Amazon Appstore for Android and Amazon Instant Video. Drinks and appetizers will be served and we look forward to welcoming you on campus.

June 7, 2011
Amazon’s Campus in South Lake Union
345 Boren Avenue North
Seattle, WA 98109

Reasons to attend

Stay engaged in the local technology community and meet with like-minded individuals and companies who have helped maintain our community’s thriving technology culture

Learn more about recent engineering innovations at Amazon

Get an inside look at Amazon’s new South Lake Union headquarters

Who should attend:

Technology leaders, professionals and educators such as CIO’s, CTO’s, IT managers, consultants, SDEs, solution architects, administrators and professors of engineering and computer sciences

Failure Analysis ideas applied to Data Center

May 31, 2011 Dave Ohara

James Hamilton has a post on what went wrong at the Fukushima Nuclear power plant.

What Went Wrong at Fukushima Dai-1

As a boater, there are times when I know our survival is 100% dependent upon the weather conditions, the boat, and the state of its equipment. As a consequence, I think hard about human or equipment failure modes and how to mitigate them. I love reading the excellent reporting by the UK Marine Accident Investigation Board. This publication covers human and equipment related failures on commercial shipping, fishing, and recreational boats. I read it carefully and I’ve learned considerably from it.

James makes the point of how he connects his boating mindset to running IT services.

I treat my work in much the same way. At work, human life is not typically at risk but large service failures can be very damaging and require the same care to avoid. As a consequence, at work I also think hard about possible human or equipment failure modes and how to mitigate them.

In one of my first jobs I worked at HP I worked in quality engineering and spent a lot of time in Palo Alto using their failure analysis facilities and learned ESD issues from Dick Moss.

Discussing Reliability Engineering and Data Centers is not common. Running a search on "reliability engineer data center" turned up this job post at Google.

The role: Data Center Reliability and Maintenance Engineer

The Data Center Operations team designs and operates one of the largest and most sophisticated power and cooling systems in the world. You should have extensive experience being involved in the large-scale technical operations, and demonstrable problem-solving skills to lead the RCM program for the Data Center team with limited oversight. You should possess excellent communication skills, attention to detail, and the ability to create work process and procedures to enable the collection of highly accurate field operational data. You will have access to reliability data for one of the largest data center footprints globally and be expected to interact with other reliability and software engineers to holistically address the reliability issues and develop a program wide data acquisition system to continually increase reliability and PUE while lowering TCO.

Responsibilities:

Develop RCM (reliability centered maintenance) program in collaboration with multiple stakeholders.

Perform Reliability Engineering analysis based on field data collected on the critical systems and equipment through the use of proven industry techniques and principles such as RCA (root cause analysis) & FMEA (Failure Modes and Effects Analysis).

Present data based Reliability Predictions and Reliability Block Diagrams.

Collaborate on the selection of the critical equipment vendors based on past operational data on equipment failures.

Spearhead on all RCA effort through collaboration w/equipment vendors.

Will Automation automatically fix Amazon's Outage issues? automat* mentioned 9 times in post mortem

May 3, 2011 Dave Ohara

Amazon posted a while ago its post-mortem on the AWS outage. One of the entertaining ways to look at the Summary is the number of times "automat*" gets used.

Here are a few examples.

For these database instances, customers with automatic backups turned on (the default setting) had the option to initiate point-in-time database restore operations.

RDS multi-AZ deployments provide redundancy by synchronously replicating data between two database replicas in different Availability Zones. In the event of a failure on the primary replica, RDS is designed to automatically detect the disruption and fail over to the secondary replica. Of multi-AZ database instances in the US East Region, 2.5% did not automatically failover after experiencing “stuck” I/O. The primary cause was that the rapid succession of network interruption (which partitioned the primary from the secondary) and “stuck” I/O on the primary replica triggered a previously un-encountered bug. This bug left the primary replica in an isolated state where it was not safe for our monitoring agent to automatically fail over to the secondary replica without risking data loss, and manual intervention was required. We are actively working on a fix to resolve this issue.

So, AWS figured out there was a bug in the monitoring agent to automatically fail over.

This bug left the primary replica in an isolated state where it was not safe for our monitoring agent to automatically fail over to the secondary replica without risking data loss, and manual intervention was required. We are actively working on a fix to resolve this issue.

And, they are going to fix the problem with more automation.

We will audit our change process and increase the automation to prevent this mistake from happening in the future.

Here are a few more areas where automat* is mentioned.

We’ll also continue to deliver additional services like S3, SimpleDB and multi-AZ RDS that perform multi-AZ level balancing automatically so customers can benefit from multiple Availability Zones without doing any of the heavy-lifting in their applications.

Speeding Up Recovery

We will also invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster. We have a number of operational tools for managing an EBS cluster, but the fine-grained control and throttling the team used to recover the cluster will be built directly into the EBS nodes. We will also automate the recovery models that we used for the various types of volume recovery that we had to do. This would have saved us significant time in the recovery process.

With automat* mentioned so many times, it makes you think there is a lot of manual work going on in AWS.

If you want an automated private cloud, you could learn from some of the original AWS EC2 team as they started the company http://nimbula.com/

The Media Data Center War - Apple started with Music, Amazon started with Books, who will win?

April 22, 2011 Dave Ohara

Engadget reports reports on the Amazon Android powered Tablet for the Summer of 2011.

Amazon to take on Apple this summer with Samsung-built tablet?

By Thomas Ricker posted Apr 21st 2011 6:35AM

You really should pay attention when Engadget's founder, Peter Rojas speaks about the tech industry. Especially when he leads into a story like this:

It's something of an open secret that Amazon is working on an Android tablet and I am 99 percent certain they are having Samsung build one for them.

Which makes sense to follow the announcement of Amazon's cloud drive.

One data center guy I was talking to said the Apple guys aren't worried about Cloud Drive as Amazon will get sued. I made the point that that makes sense if you are Apple, but not necessarily if you are Amazon. Apple has had media companies like Apple Records battling the company for decades. Amazon is looking to disrupt Apple's business models. The huge margins Apple makes are opportunities for Bezo's crowd.

The battle for who will win the next media battle will be fought in the cloud in addition to the devices. The data centers are key to the strategy. I place my bets on Amazon. The media and loyal Mac users will be on Apple. Amazon's thin retail margins has forced them to think efficiently, whereas Apple promotes think differently.

BTW, thinking efficiently usually aligns with less energy which is better for a green data center.