Does Google's Data Center Machine Language Model have a debug mode? It should

I threw two posts(1st post and 2nd post) up on Google’s use of Machine Language in the Data Center and said I would write more.  Well here is another one.

Does Google’s Data Center Machine Language Model have a debug mode?  The current system describes the use of data collected every 5 minutes over about 2 years.

 184,435 time samples at 5 minute resolution (approximately 2 years of operational data

One of the methods almost no one does is debug their mechanical systems as if you were debugging software. 

Debugging is a methodical process of finding and reducing the number of bugs, or defects, in a computer program or a piece of electronic hardware, thus making it behave as expected. Debugging tends to be harder when various subsystems are tightly coupled, as changes in one may cause bugs to emerge in another.

What would debugging mode look like in DCMLM (my own acronym for Data Center Machine Language Model)?  You are seeing performance that looks like the subsystem is not performing as expected.  Change the sampling rate to 1 second.  Hopefully the controller will function correctly at a higher sample rate.  The controller may work fine, but the transport bus may not.  With the 1 second fidelity make changes to settings and collect data.  Repeat changes.  Compare results.  Create other stress cases.

What will you see?  From the time you make the changes in a setting how long does it take for you to achieve the desired state.  At the 5 minute sampling you cannot see the transition and the possibly delays.  Was the transition smooth or a step function.  Was there an overshoot in value and then corrections?

The controllers have code running in them, sensors go bad, wiring connections are intermittent.  How do you find these problems?  Being able to go into Debug mode could be useful.

If Google was able to compare detailed operations of two different installations of the same mechanical system, then they could find whether there was a problem that is unique to a site.  Or they may simply compare the same system at different points of time.

Apple's Acquisition of Beats may be beyond sound

Remember when people were wearing bluetooth ear pieces.  I have a couple that I haven’t used for years.  Why?  Sound Quality, battery life, and connectivity.  I have enough problems keeping a cell connection, let along worrying about battery life and bluetooth. So, I always use a hardwire headset.  It’s probably greener too because you eliminate the use bluetooth.

Now that Apple has officially announced the acquisition of Beats, the media and analysts are asking why?

Katy Huberty, Morgan Stanley: Apple beats the service drum."Subscription music service could make the deal a home run, with every 1% penetration of Apple's 800M account base equating to $960M of revenue. Apple believes Beats offers the right strategy for streaming music as it leverages both algorithms and 200 human curators to create playlists, which differentiates it from competitors."

Scott Craig, Merrill Lynch: Expensive acquisition. "The $3bn price tag sounds high, especially given Carlyle Group's investment in Beat Electronics in September 2013 valued the company at ~$1bn. This seems out of character with Apple's track record of acquisitions which are typically more tuck-in types with focus on IP/technology, rather than brand. Apple would also be acquiring high end, high margin headphones which according to NPD, account for 27% of the headphone market and 57% of premium headphone market ($99+)."

William Powers, Baird: Technology portfolio light. "We would note that the recently launched Beats Music streaming service is reported to have a mere ~250,000 subscribers and only has two awarded patents. Beats Electronics has 20 awarded patents, most of which are for ornamental design, vs. Harman's 868 patents and Bose's 558."

Given the pervasive use of headsets it is a standard accessory.  My daughter demanded a Beats headset, I have Klipsch earbuds and use Apple’s standard ones.  It is time to change what a headset is.  You can embed sensors for heart rate.  It’s stuck in your ear.  Probably can read body temperature.  Put in motion sensors and you can figure out what way people are looking.  Put in a small camera and you can do point of view.

If Apple wanted to change the headset into an extension of the iPhone what company should it buy? an established company in sound?  Or someone who is creating new business models, like Beats?

I laugh when people only see what is in front of them.  Beats is a headphone and streaming music company.  No, Beats is a company that has successfully done things with headsets and has created a must have accessory for actors, athletes, and other high visibility people.  It’s not about sound, it’s about looking good.

Imagine if the high performing athletes used Apple/Beats headphones as their training routine, then sell 1,000x more to amateur athletes.  Could be why Nike decided to get out of the fuel band business.

Could Beats create a sunglasses with a headset and sensor technology?  Oops, this beats Google Glasses and it is cool.   Wearing shades inside, while watching video  is going to look so much better on an Apple/Beats glasses than on Google glasses.

Google's Machine Learning Application is a Tool for AI, but not AI

Not AI, it is machine learning a tool to support mathematical model of a data centers mechanical systems

 

If you Google Image Search “artificial intelligence” you see these images.

NewImage

This is not what Google’s data center group has built with an application of machine learning.  When you Google Image Search “neural network” you see this.

 

NewImage

 

Google’s method to improve the efficiency of its data centers optimizes for cost is a machine learning application, not as covered in the media an artificial intelligence system. Artificial Intelligence is easy for many to assume the system thinks.   Google’s Machine Learning takes 19 inputs, then creates a predicted PUE with 99.6% accuracy and the settings to achieve that PUE.

 

Problems to be solved

  1. The interactions between DC Mechanical systems and various feedback loops make it difficult to accurately predict DC efficiency using traditional engineering formulas.
  2. Using standard formulas for predictive modeling often produces large errors because they fail to capture such complex interdependencies.
  3. Testing each and every feature combination to maximize efficiency would be unfeasible given time constraints, frequent fluctuations in the IT load and weather conditions, as well as the need to maintain a stable DC environment.

These problems describe the difficulty to build a mathematical model of the system.

 

Why Neural Networks? 

To address these problems, a neural network is selected as the mathematical framework for training DC energy efficiency models. Neural networks are a class of machine learning algorithms that mimic cognitive behavior via interactions between artificial neurons [6]. They are advantageous for modeling intricate systems because neural networks do not require the user to predefine the feature interactions in the model, which assumes relationships within the data. Instead, the neural network searches for patterns and interactions between features to automatically generate a best fit model.

 

There are 19 different factors input that are inputs to the neural network

1. Total server IT load [kW]

2. Total Campus Core Network Room (CCNR) IT load [kW]

3. Total number of process water pumps (PWP) running

4. Mean PWP variable frequency drive (VFD) speed [%]

5. Total number of condenser water pumps (CWP) running

6. Mean CWP variable frequency drive (VFD) speed [%]

7. Total number of cooling towers running

8. Mean cooling tower leaving water temperature (LWT) setpoint [F]

9. Total number of chillers running

10. Total number of drycoolers running

11. Total number of chilled water injection pumps running

12. Mean chilled water injection pump setpoint temperature [F]

13. Mean heat exchanger approach temperature [F]

14. Outside air wet bulb (WB) temperature [F]

15. Outside air dry bulb (DB) temperature [F]

16. Outside air enthalpy [kJ/kg]

17. Outside air relative humidity (RH) [%]

18. Outdoor wind speed [mph]

19. Outdoor wind direction [deg]

 

There are five hidden layers with 50 nodes per layer.  The hidden layers are the blue circles in the below diagram.  The red circles are the 19 different inputs.  The yellow circle is the output predicted PUE.

NewImage

Multiple iterations are run to reduce cost.  The cost function is below.

NewImage

 

Results:

A machine learning approach leverages the plethora of existing sensor data to develop a mathematical model that understands the relationships between operational parameters and the holistic energy efficiency. This type of simulation allows operators to virtualize the DC for the purpose of identifying optimal plant configurations while reducing the uncertainty surrounding plant changes.

 

NewImage

 

Note: I have used Jim Gao’s document with some small edits to create this post.

 

Google Implements Software Defined PUE, 99.6% Accurate DC Performance Modeling

Google has posted a paper on its Machine Learning Application for Data Center Optimization and a blog post.

PUE is a topic that Google knows well.  I was one of the first to interview Urs Hoelzle on Google’s PUE back in Oct 2008.  Over the years Google has shared more and more about its data centers, and their PUE has continued to get better, but there are limits of what people can do.

NewImage

After a while you reach a level of diminishing returns and people get tired and frustrated of getting that 0.01 improvement in PUE. 

Seems obvious to use computers and run simulations of operations.  Many have tried to build models, but this approach has not been wildly successful as the complexity of data center cooling systems is not easy to model.

Simulation is the imitation of the operation of a real-world process or system over time.[1] The act of simulating something first requires that a model be developed; this model represents the key characteristics or behaviors/functions of the selected physical or abstract system or process. 

Another approach is to discover the model based on operations data, and let the data define the model, a machine learning model uses this approach.  Warning this approach which is used in handwriting recognition and OCR requires training and testing to confirm the model is accurate.  Luckily Google with all the years of tracking PUE has data to train a model and a Mechanical Engineer who had the vision to tackle this problem.

Jim Gao, an engineer on our data center team, is well-acquainted with the operational data we gather daily in the course of running our data centers. We calculate PUE, a measure of energy efficiency, every 30 seconds, and we’re constantly tracking things like total IT load (the amount of energy our servers and networking equipment are using at any time), outside air temperature (which affects how our cooling towers work) and the levels at which we set our mechanical and cooling equipment. Being a smart guy—our affectionate nickname for him is “Boy Genius”—Jim realized that we could be doing more with this data. He studied up on machine learning and started building models to predict—and improve—data center performance.  

After some trial and error, Jim’s models are now 99.6 percent accurate in predicting PUE. This means he can use the models to come up with new ways to squeeze more efficiency out of our operations. For example, a couple months ago we had to take some servers offline for a few days—which would normally make that data center less energy efficient. But we were able to use Jim’s models to change our cooling setup temporarily—reducing the impact of the change on our PUE for that time period.

Why do this?

  1. A machine learning approach leverages the plethora of existing sensor data to develop a mathematical model that understands the relationships between operational parameters and the holistic energy efficiency. 
  2. This type of simulation allows operators to virtualize the DC for the purpose of identifying optimal plant configurations while reducing the uncertainty surrounding plant changes.
  3. Model applications include DC simulation to evaluate new plant configurations, assessing energy efficiency performance, and identifying optimization opportunities.

Results:  Google predicts PUE with 99.6% accuracy.  Google has successfully modeled its mechanical systems using machine learning.  Google has implemented a Software Defined PUE which allows them to predict PUE as systems and load changes.  Who doesn’t want this capability?  Almost everyone who has bought DCIM thought they would get this.

NewImage

There are many other ideas that Google has put together and I plan on writing more posts on what they have shared.  This is just the beginning of applying machine learning and neural networks to data center operations.  There are many other complex interactions that can be modeled.