Google Implements Software Defined PUE, 99.6% Accurate DC Performance Modeling

Google has posted a paper on its Machine Learning Application for Data Center Optimization and a blog post.

PUE is a topic that Google knows well.  I was one of the first to interview Urs Hoelzle on Google’s PUE back in Oct 2008.  Over the years Google has shared more and more about its data centers, and their PUE has continued to get better, but there are limits of what people can do.

NewImage

After a while you reach a level of diminishing returns and people get tired and frustrated of getting that 0.01 improvement in PUE. 

Seems obvious to use computers and run simulations of operations.  Many have tried to build models, but this approach has not been wildly successful as the complexity of data center cooling systems is not easy to model.

Simulation is the imitation of the operation of a real-world process or system over time.[1] The act of simulating something first requires that a model be developed; this model represents the key characteristics or behaviors/functions of the selected physical or abstract system or process. 

Another approach is to discover the model based on operations data, and let the data define the model, a machine learning model uses this approach.  Warning this approach which is used in handwriting recognition and OCR requires training and testing to confirm the model is accurate.  Luckily Google with all the years of tracking PUE has data to train a model and a Mechanical Engineer who had the vision to tackle this problem.

Jim Gao, an engineer on our data center team, is well-acquainted with the operational data we gather daily in the course of running our data centers. We calculate PUE, a measure of energy efficiency, every 30 seconds, and we’re constantly tracking things like total IT load (the amount of energy our servers and networking equipment are using at any time), outside air temperature (which affects how our cooling towers work) and the levels at which we set our mechanical and cooling equipment. Being a smart guy—our affectionate nickname for him is “Boy Genius”—Jim realized that we could be doing more with this data. He studied up on machine learning and started building models to predict—and improve—data center performance.  

After some trial and error, Jim’s models are now 99.6 percent accurate in predicting PUE. This means he can use the models to come up with new ways to squeeze more efficiency out of our operations. For example, a couple months ago we had to take some servers offline for a few days—which would normally make that data center less energy efficient. But we were able to use Jim’s models to change our cooling setup temporarily—reducing the impact of the change on our PUE for that time period.

Why do this?

  1. A machine learning approach leverages the plethora of existing sensor data to develop a mathematical model that understands the relationships between operational parameters and the holistic energy efficiency. 
  2. This type of simulation allows operators to virtualize the DC for the purpose of identifying optimal plant configurations while reducing the uncertainty surrounding plant changes.
  3. Model applications include DC simulation to evaluate new plant configurations, assessing energy efficiency performance, and identifying optimization opportunities.

Results:  Google predicts PUE with 99.6% accuracy.  Google has successfully modeled its mechanical systems using machine learning.  Google has implemented a Software Defined PUE which allows them to predict PUE as systems and load changes.  Who doesn’t want this capability?  Almost everyone who has bought DCIM thought they would get this.

NewImage

There are many other ideas that Google has put together and I plan on writing more posts on what they have shared.  This is just the beginning of applying machine learning and neural networks to data center operations.  There are many other complex interactions that can be modeled.