Kfir Godrich discusses Data Center Commissioning role in delivering availability

I've had the pleasure of some great conversations with Kfir Godrich.  Kfir has a guest post on Compass Data Centers blog that discusses Data Center Commissioning.

Kfir starts with a subject that reminds me of my first summer jobs at HP working in Quality engineering at HP where I worked on warranty and reliability issues.

The data center commissioning (or Cx) journey starts with understanding the basics of reliability engineering contained in the IEEE Gold Book. First, we need to define the difference between reliability and availability. Availability is the probability that a system will work as required during the period of the mission while Reliability is the probability that the system will in fact maintain operations during the mission. The related terminology that helps us introduce the Cx, is the data center predicted performance model. This model follows a failure mode typical to electronic equipment also known as the “bathtub curve” (see Fig. 1).

Bathtub Curve

In Phase 1, also called the Infant Mortality Period, data centers are going through a decreasing failure rate that it is very much desired to be as short as possible. This can be achieved through performing a full commissioning as described later. It is the author’s humble opinion that the level of commissioning must be proportional to the level of criticality and design Tier (per Uptime Institute) of the data center.

In Phase 2, referred to as the Random Failure Period, the failure rate is constant and mostly known by the fact that MTBF (Mean Time Between Failures) is calculated during this phase. The desire here is to take that flat curve as low as possible. In Phase 3, The Wear-out Period – is where components begin to reach the end of their usable life. Replacing components proactively aids in delaying the ultimate upturn in the graph.

This post is the first in a series so if you are interested in this topic there will be more.

Therefore, data center commissioning is about enabling the business through performance validation and functional testing of integrated platforms. This should typically be performed by an independent agent as part of the customers trusted advisory team and as a core part of the overall project schedule. The cost for a commissioning agent can be in the range of 0.8-2% of the total budget. Since commissioning is essential for government facilities, the US Department of Energy is publishing certain guidelines for commissioning scope and cost. Geographically, commissioning is more popular and comprehensive in North America and parts of Western Europe while the rest of the world is becoming more familiar these concepts. Our next Blog will go a bit deeper into the Integrated Testing—stay tuned. Till next time, Kfir

Kfir's new company is here.

7 things that are wrong with many Enterprise IT systems?

The Enterprise IT organization is an interesting entity.  The following are some observations I have made and are interesting problems to try and solve.

  1. The main priority of many people in Enterprise IT is to protect their jobs.  Due to the crappy way that many companies have treated their IT organizations.  It is a thankless job in many companies. In the past, efforts have been made to outsource the job to other companies and India.  So, the people who are still around have developed a harsh survival instinct to do whatever it takes to protect their job.
  2. Too many nice people who try to do the right thing are the victims not the heroes.
  3. 80% or more of enterprise IT is full of people who are not technical by education. I have been spoiled working in product development at HP, Apple, and Microsoft where you hire the the best technical people to develop products.  These are what I consider technical staff.  They really know how things work and be so valuable people will pay money for them.  Other than Amazon Web Services, what enterprise IT has built an IT system so well that people would pay money for it?
  4. So the 20% of enterprise IT that are technical, can they make the really tough decisions?  Many times no, because the decisions in most enterprise IT systems are not made by the most technical people it is made by the people who have the strongest survival instinct.
  5. The Cloud is a threat to the monopoly of enterprise IT.  Until the Cloud, users had to use the enterprise e-mail system, CRM, file servers, web hosting, etc.  Now the business units have choice.
  6. The private cloud's #1 goal in many companies is to shut down the choice of going outside the enterprise IT monopoly.
  7. The private cloud will be much more expensive than the public cloud, because the private cloud's goal is to protect jobs where the public cloud's goal is to reduce costs which means higher utilization of all resources including people.  Cloud environments have one admin per 1,000+ servers.  Many enterprises have one admin per 10-20 servers.  Some have moved to 100.  Few have achieved 1,000.

Huh, these things sound like they could be in a Dilbert cartoon.  They probably have been.

Here is today's Dilbert cartoon.

NewImage

Analyst Roundtable: ARM in the data center

On Dec 19, 2012 10a will be a webinar discussing the ARM chip in the data center.  The webinar is here.

Power matters: using ARM to reduce data center costsA consumer using a computer to shop, email, search, or any of the other myriad tasks now possible, usually measures power consumption by the dollar figure on the monthly electrical bill. But as two recent, highly controversial articles in the New York Times reiterated, U.S. data centers backing up those tasks consume as much as two percent of the nation’s power consumption. Long before the articles appeared, data centers were aware of the problem and had begun employing various strategies to lower cooling costs, eliminate redundancies, and improve power usage effectiveness (PUE).

Another solution is deploying ultra-low power servers that reduce data center power needs – in a sense, reinventing the server. How does the efficiency of this solution stack up against the alternatives? What are some specific use cases? And, what’s the future for ARM-based server solutions? For answers to these and many other questions, join GigaOM Pro and our sponsor Calxeda for “Power matters: using ARM to reduce data center costs,” a free analyst roundtable webinar on Wednesday, December 19, 2012, at 10 a.m. PT.

I'll be on the webinar and Barry Evans from Calxeda.

Our panel of experts includes:

Think Different, Infrastructure as an executive position, example Google's SVP Urs Hoelzle

3 years ago I was introduced to how differently Google thinks of the word infrastructure when a Google guy I met said he worked on Google Infrastructure.  My context was from thinking of the definition.

Definition of INFRASTRUCTURE

1
: the underlying foundation or basic framework (as of a system or organization)
2
: the permanent installations required for military purposes
3
: the system of public works of a country, state, or region;also : the resources (as personnel, buildings, or equipment) required for an activity

So, he worked in the data center group.   No, he worked on Google Infrastructure.  Search - the underlying foundation or basic framework of the company.  Cities are built on infrastructure which is where we commonly get the use of the word.

The term typically refers to the technical structures that support a society, such as roads,bridges, water supplysewerselectrical gridstelecommunications, and so forth, and can be defined as "the physical components of interrelated systems providing commodities and services essential to enable, sustain, or enhance societal living conditions."[3]

Google goes far in its use of Infrastructure to the point where Urs Hoelzle says he is an Infrastructure Czar.

Urs Hölzle, Google’s infrastructure czar tells us what the Cloud really is and what it is supposed to do.

It was nearly five years ago when I last spent time with Urs Hölzle, Google’s infrastructure czar. (His official title is SVP of operations.) It was around that time he introduced me (and several others) to many of the concepts (such as cloud and big data) that are now part of the technology sector’s vernacular. Hölzle was company’s first VP of engineering, and he has led the development of Google’s technical infrastructure.

Hölzle’s current responsibilities include the design and operation of the servers, networks and data centers that power Google. It would be an understatement to say that he is amongst the folks who have shaped the modern web-infrastructure and cloud-related standards. When I had a chance to chat with him recently, my question was, “How do you define the cloud?”

...

Others might disagree, but Hölzle believes Google’s common infrastructure gives it a technological and financial edge over on-premise solutions. “We’re able to avoid some of that fragmentation and build on a common infrastructure,” says Hölzle. “That’s actually one of the big advantages of the cloud.”

Do you have an infrastructure czar or VP at your Web2.0, cloud company?  If not, you may have a hard time competing against those who have figured out how important infrastructure is.

Are you blind in the the data center world? You can't see and remember everything, so at times yes

If you asked an experienced data center person how many times a day they are blind to what is going on. They don't know. Why?  Because, you are asking them to see what they don't perceive, how many times you miss seeing something. 

i-Perception has a post on research done to discover the frequency of when people miss a fight.

NewImage

NewImage

If you don't think this research applies, then you are probably of a mindset that you have a photographic memory and can remember all kind of details.  But, it is impossible to have a perfect photographic memory.

The Truth About Photographic Memory

When a professor studied Eidetic (photographic memory), he found they were not perfect.

Alan Searleman, a professor of psychology at St. Lawrence University in New York, says eidetic imagery comes closest to being photographic. When shown an unfamiliar image for 30 seconds, so-called "eidetikers" can vividly describe the image—for example, how many petals are on a flower in a garden scene. They report "seeing" the image, and their eyes appear to scan across the image as they describe it. Still, their reports sometimes contain errors, and their accuracy fades after just a few minutes. Says Searleman, "If they were truly 'photographic' in nature, you wouldn't expect any errors at all."

Now, you may think you are the exception, but consider this reason why we don't have photographic memory.  

Although psychologists don't know why children lose the ability, the loss of this skill may be functional: Were humans to remember every single image, it would be difficult to make it through the day.

And, even if you do have photographic memory, does everyone else in your data center team?

So, how many mistakes and errors in judgement are made because people are absolutely sure they saw something or sure that something did not occur, when in fact they are wrong.

Being wrong is painful, and the reality is we are blind every day.  Yet, how many systems, processes, and management decisions all make the assumption that you see everything and remember all the details. That everyone has a perfect photographic memory.