Does Google's Data Center Machine Language Model have a debug mode? It should

I threw two posts(1st post and 2nd post) up on Google’s use of Machine Language in the Data Center and said I would write more.  Well here is another one.

Does Google’s Data Center Machine Language Model have a debug mode?  The current system describes the use of data collected every 5 minutes over about 2 years.

 184,435 time samples at 5 minute resolution (approximately 2 years of operational data

One of the methods almost no one does is debug their mechanical systems as if you were debugging software. 

Debugging is a methodical process of finding and reducing the number of bugs, or defects, in a computer program or a piece of electronic hardware, thus making it behave as expected. Debugging tends to be harder when various subsystems are tightly coupled, as changes in one may cause bugs to emerge in another.

What would debugging mode look like in DCMLM (my own acronym for Data Center Machine Language Model)?  You are seeing performance that looks like the subsystem is not performing as expected.  Change the sampling rate to 1 second.  Hopefully the controller will function correctly at a higher sample rate.  The controller may work fine, but the transport bus may not.  With the 1 second fidelity make changes to settings and collect data.  Repeat changes.  Compare results.  Create other stress cases.

What will you see?  From the time you make the changes in a setting how long does it take for you to achieve the desired state.  At the 5 minute sampling you cannot see the transition and the possibly delays.  Was the transition smooth or a step function.  Was there an overshoot in value and then corrections?

The controllers have code running in them, sensors go bad, wiring connections are intermittent.  How do you find these problems?  Being able to go into Debug mode could be useful.

If Google was able to compare detailed operations of two different installations of the same mechanical system, then they could find whether there was a problem that is unique to a site.  Or they may simply compare the same system at different points of time.

Google's Machine Learning Application is a Tool for AI, but not AI

Not AI, it is machine learning a tool to support mathematical model of a data centers mechanical systems


If you Google Image Search “artificial intelligence” you see these images.


This is not what Google’s data center group has built with an application of machine learning.  When you Google Image Search “neural network” you see this.




Google’s method to improve the efficiency of its data centers optimizes for cost is a machine learning application, not as covered in the media an artificial intelligence system. Artificial Intelligence is easy for many to assume the system thinks.   Google’s Machine Learning takes 19 inputs, then creates a predicted PUE with 99.6% accuracy and the settings to achieve that PUE.


Problems to be solved

  1. The interactions between DC Mechanical systems and various feedback loops make it difficult to accurately predict DC efficiency using traditional engineering formulas.
  2. Using standard formulas for predictive modeling often produces large errors because they fail to capture such complex interdependencies.
  3. Testing each and every feature combination to maximize efficiency would be unfeasible given time constraints, frequent fluctuations in the IT load and weather conditions, as well as the need to maintain a stable DC environment.

These problems describe the difficulty to build a mathematical model of the system.


Why Neural Networks? 

To address these problems, a neural network is selected as the mathematical framework for training DC energy efficiency models. Neural networks are a class of machine learning algorithms that mimic cognitive behavior via interactions between artificial neurons [6]. They are advantageous for modeling intricate systems because neural networks do not require the user to predefine the feature interactions in the model, which assumes relationships within the data. Instead, the neural network searches for patterns and interactions between features to automatically generate a best fit model.


There are 19 different factors input that are inputs to the neural network

1. Total server IT load [kW]

2. Total Campus Core Network Room (CCNR) IT load [kW]

3. Total number of process water pumps (PWP) running

4. Mean PWP variable frequency drive (VFD) speed [%]

5. Total number of condenser water pumps (CWP) running

6. Mean CWP variable frequency drive (VFD) speed [%]

7. Total number of cooling towers running

8. Mean cooling tower leaving water temperature (LWT) setpoint [F]

9. Total number of chillers running

10. Total number of drycoolers running

11. Total number of chilled water injection pumps running

12. Mean chilled water injection pump setpoint temperature [F]

13. Mean heat exchanger approach temperature [F]

14. Outside air wet bulb (WB) temperature [F]

15. Outside air dry bulb (DB) temperature [F]

16. Outside air enthalpy [kJ/kg]

17. Outside air relative humidity (RH) [%]

18. Outdoor wind speed [mph]

19. Outdoor wind direction [deg]


There are five hidden layers with 50 nodes per layer.  The hidden layers are the blue circles in the below diagram.  The red circles are the 19 different inputs.  The yellow circle is the output predicted PUE.


Multiple iterations are run to reduce cost.  The cost function is below.




A machine learning approach leverages the plethora of existing sensor data to develop a mathematical model that understands the relationships between operational parameters and the holistic energy efficiency. This type of simulation allows operators to virtualize the DC for the purpose of identifying optimal plant configurations while reducing the uncertainty surrounding plant changes.




Note: I have used Jim Gao’s document with some small edits to create this post.


Google Implements Software Defined PUE, 99.6% Accurate DC Performance Modeling

Google has posted a paper on its Machine Learning Application for Data Center Optimization and a blog post.

PUE is a topic that Google knows well.  I was one of the first to interview Urs Hoelzle on Google’s PUE back in Oct 2008.  Over the years Google has shared more and more about its data centers, and their PUE has continued to get better, but there are limits of what people can do.


After a while you reach a level of diminishing returns and people get tired and frustrated of getting that 0.01 improvement in PUE. 

Seems obvious to use computers and run simulations of operations.  Many have tried to build models, but this approach has not been wildly successful as the complexity of data center cooling systems is not easy to model.

Simulation is the imitation of the operation of a real-world process or system over time.[1] The act of simulating something first requires that a model be developed; this model represents the key characteristics or behaviors/functions of the selected physical or abstract system or process. 

Another approach is to discover the model based on operations data, and let the data define the model, a machine learning model uses this approach.  Warning this approach which is used in handwriting recognition and OCR requires training and testing to confirm the model is accurate.  Luckily Google with all the years of tracking PUE has data to train a model and a Mechanical Engineer who had the vision to tackle this problem.

Jim Gao, an engineer on our data center team, is well-acquainted with the operational data we gather daily in the course of running our data centers. We calculate PUE, a measure of energy efficiency, every 30 seconds, and we’re constantly tracking things like total IT load (the amount of energy our servers and networking equipment are using at any time), outside air temperature (which affects how our cooling towers work) and the levels at which we set our mechanical and cooling equipment. Being a smart guy—our affectionate nickname for him is “Boy Genius”—Jim realized that we could be doing more with this data. He studied up on machine learning and started building models to predict—and improve—data center performance.  

After some trial and error, Jim’s models are now 99.6 percent accurate in predicting PUE. This means he can use the models to come up with new ways to squeeze more efficiency out of our operations. For example, a couple months ago we had to take some servers offline for a few days—which would normally make that data center less energy efficient. But we were able to use Jim’s models to change our cooling setup temporarily—reducing the impact of the change on our PUE for that time period.

Why do this?

  1. A machine learning approach leverages the plethora of existing sensor data to develop a mathematical model that understands the relationships between operational parameters and the holistic energy efficiency. 
  2. This type of simulation allows operators to virtualize the DC for the purpose of identifying optimal plant configurations while reducing the uncertainty surrounding plant changes.
  3. Model applications include DC simulation to evaluate new plant configurations, assessing energy efficiency performance, and identifying optimization opportunities.

Results:  Google predicts PUE with 99.6% accuracy.  Google has successfully modeled its mechanical systems using machine learning.  Google has implemented a Software Defined PUE which allows them to predict PUE as systems and load changes.  Who doesn’t want this capability?  Almost everyone who has bought DCIM thought they would get this.


There are many other ideas that Google has put together and I plan on writing more posts on what they have shared.  This is just the beginning of applying machine learning and neural networks to data center operations.  There are many other complex interactions that can be modeled.

Google's VP of Data Centers Joe Kava shares Best Practices

Here is a collection of ideas Google’s VP of Data Centers Joe Kava has shared recently.

Some Tweets from Joe’s Keynote at Uptime.

  1. Joe Kava at : Data center build and maintain contracts are important but relationships you foster are more important.

  2. At : 's Joe Kava on unified IT/DC: "We consider it a manaufacturing process. We manufacture data processing"

  3. Joe Kava, VP of Data Centers, Google: bring DC Ops team on site 6 months before go-live of new build so they can learn the site.

Cover story from Facilities Net.



As Google Grows, It's Up To Joe Kava To Ensure Data Centers Keep Pace

By Casey Laughman, Managing Editor - May 2014 - Data Centers



joe kavadata center designdata center energy efficiencydata centersgoogle


Whether it's directions, email, or figuring out who played Goon No. 2 in that old movie the other night, the odds are pretty good that you have used Google recently. After all, a company name becoming a verb is a pretty good indication that it's become the de facto standard for its industry. In Google's industry, that kind of use demands some heavy-duty support from the data center. It's up to Joe Kava, vice president of data centers, to make sure that support is delivered in all aspects of the company's data centers, from design to operations.

Considering the growth spurt the company has been on, it isn't an easy task. In 2006, Google brought online its first owned and operated data center, a $1.2 billion facility in The Dalles, Ore. Since then, the company has brought or will soon be bringing online 11 more data centers spread across six countries and four continents. As the person responsible for Google's data centers since joining the company in 2008, Kava has been in charge of most of those projects.

And Joe has another presentation coming up on May 28th.

Joe KavaMasterclass
Beyond the PUE Plateau

Presented by Joe Kava, Vice President Data Center, Google

Lean At Amazon, the future is not where you would think, but good for data center geeks

It is rare to get a document written about’s processes.   McKinsey has a post on an interview with Mark Onetto.  Who is Mark?

Marc Onetto

Marc Onetto biography

Vital Statistics

Born September 3, 1950, in Paris, France

Graduated with an MS in engineering in 1973 from École Centrale de Lyon and with an MBA in industrial administration in 1975 from Carnegie Mellon University’s Tepper School of Business
Career Highlights (2006–13)

  • Senior vice president of worldwide operations and customer service

The most surprising part was what Marc says is the next frontier lean-management principles applied to software creation.

Next frontiers

Perhaps the biggest challenge I see is the application of lean-management principles to software creation, which is highly complex, with numerous opportunities for defects. Software engineers have not yet been able to stop the line and detect defects in real time during development. The only real testing happens once the software is completed, with the customer as a beta tester. To me, this is unacceptable; we would never do that with a washing machine. We would not ask customers to tell us when the washer leaks or what’s wrong with it once it has arrived at their homes. I’ve tried to address the problem, and some of Amazon’s computer-science engineers have looked at it, but it is still one of the biggest challenges for lean.