Dynamic Power Usage Effectiveness (PUE) to real world dynamic energy efficiency

I've updated this entry with a post at /2008/01/dynamic-pue-rea.html and appended this post with the entry.

PUE = Total Facility Power / IT Equipment Power

I am about ready to write on the subject of energy efficiency measurements and how PUE should be calculate over time and conditions, and I was fortunate to find a blog entry on this by Ken Oestreich.

Why? Because IT departments operate at greatly different levels; peak (maybe during the day) as well as off-peak (perhaps nights/weekends). Ideally, the data center should know how to adapt to these conditions: re-purposing "live" machines during peak hours; retiring and temporarily shutting-down idle servers during off-peak; removing power conditioning equipment when not needed; turning off specific CRAC units and chillers when not required (i.e. cold days and/or off-peak hours). We need an efficiency metric that indicates how data centers operate Dynamically.

Cornell Medical's Biomedicine compute servers shut down in this condition, but they are not to point of having control over their cooling systems.

My plan is to write on  a customer who is dynamically measuring PUE over different operating conditions, changes in outside temperature, over time of filling the room with servers, within a facility in different colo rooms, and comparing other facilities.  Ken is right in tha PUE will change at various times. It is important to know the conditions and the theoretical PUE should be, and then you can determine if your power and cooling systems are not performing as expected.

I hope to have this article out by end of year.

Below is my continuation from /2008/01/dynamic-pue-rea.html

Dynamic PUE real world use

I've been meaning to write about PUE, and have been stumped in that It is defined as a metric, and in the Green Grid document referenced it makes no reference that is dynamic. In reality PUE will be a dynamic # that changes as the load changes in a room. How ironic would it be that your best PUE # is when all the servers are running at near capacity, and shutting down servers to save power will increase your PUE? Or your energy efficient cooling system uses large amounts of water in Southern California where it is just a matter of time before water shortages will cause more environmental issues?

What helped me to think of PUE as a dynamic # is to think of it as quality control metric. The quality of the electrical and mechanical systems and their operations over time are inputs into PUE.  As load changes and servers will be turned off the variability of the power and cooling systems influence you PUE.  So, PUE can now have a statistical range of operation given the conditions.  This sounds familiar.  It's statistical process control.

Statistical Process Control (SPC) is an effective method of monitoring a process through the use of control charts. Much of its power lies in the ability to monitor both process centre and its variation about that centre. By collecting data from samples at various points within the process, variations in the process that may affect the quality of the end product or service can be detected and corrected, thus reducing waste and as well as the likelihood that problems will be passed on to the customer. With its emphasis on early detection and prevention of problems, SPC has a distinct advantage over quality methods, such as inspection, that apply resources to detecting and correcting problems in the end product or service.

For example, a breakfast cereal packaging line may be designed to fill each cereal box with 500 grams of product, but some boxes will have slightly more than 500 grams, and some will have slightly less, in accordance with a distribution of net weights. If the production process, its inputs, or its environment changes (for example, the machines doing the manufacture begin to wear) this distribution can change. For example, as its cams and pulleys wear out, the cereal filling machine may start putting more cereal into each box than specified. If this change is allowed to continue unchecked, more and more product will be produced that fall outside the tolerances of the manufacturer or consumer, resulting in waste. While in this case, the waste is in the form of "free" product for the consumer, typically waste consists of rework or scrap.

By observing at the right time what happened in the process that led to a change, the quality engineer or any member of the team responsible for the production line can troubleshoot the root cause of the variation that has crept in to the process and correct the problem.

This last point of observing at the right time what happened in the process that led to a change ultimately what needs to be achieved with a dynamic PUE system.  Without a system like this and mindset, you wouldn't know how to fix PUE problems. Which is what I think is wrong with a static PUE mindset.  You need a closed loop feedback to monitor the PUE and see if it is performing as expected given the operating conditions and load.

Note: the point about breakfast cereal reminds of Microsoft's Mike Manos, Sr. Director Data Center Services, and his first job working in Rice a Roni operations, learning process control, which is probably why he has invested in software from OSIsoft to help monitor PUE.  Cornell uses the same SW as well.  For more details see Microsoft's Jeff O'Reilly presentation or Cornell's Jason Banfelder presentation.