Failure Analysis ideas applied to Data Center

James Hamilton has a post on what went wrong at the Fukushima Nuclear power plant.

What Went Wrong at Fukushima Dai-1

As a boater, there are times when I know our survival is 100% dependent upon the weather conditions, the boat, and the state of its equipment. As a consequence, I think hard about human or equipment failure modes and how to mitigate them. I love reading the excellent reporting by the UK Marine Accident Investigation Board. This publication covers human and equipment related failures on commercial shipping, fishing, and recreational boats. I read it carefully and I’ve learned considerably from it.

James makes the point of how he connects his boating mindset to running IT services.

I treat my work in much the same way. At work, human life is not typically at risk but large service failures can be very damaging and require the same care to avoid. As a consequence, at work I also think hard about possible human or equipment failure modes and how to mitigate them.

In one of my first jobs I worked at HP I worked in quality engineering and spent a lot of time in Palo Alto using their failure analysis facilities and learned ESD issues from Dick Moss.


Discussing Reliability Engineering and Data Centers is not common.  Running a search on "reliability engineer data center" turned up this job post at Google.

The role: Data Center Reliability and Maintenance Engineer

The Data Center Operations team designs and operates one of the largest and most sophisticated power and cooling systems in the world. You should have extensive experience being involved in the large-scale technical operations, and demonstrable problem-solving skills to lead the RCM program for the Data Center team with limited oversight. You should possess excellent communication skills, attention to detail, and the ability to create work process and procedures to enable the collection of highly accurate field operational data. You will have access to reliability data for one of the largest data center footprints globally and be expected to interact with other reliability and software engineers to holistically address the reliability issues and develop a program wide data acquisition system to continually increase reliability and PUE while lowering TCO.

  • Develop RCM (reliability centered maintenance) program in collaboration with multiple stakeholders.
  • Perform Reliability Engineering analysis based on field data collected on the critical systems and equipment through the use of proven industry techniques and principles such as RCA (root cause analysis) & FMEA (Failure Modes and Effects Analysis).
  • Present data based Reliability Predictions and Reliability Block Diagrams.
  • Collaborate on the selection of the critical equipment vendors based on past operational data on equipment failures.
  • Spearhead on all RCA effort through collaboration w/equipment vendors.