Human Perception behind Joyent and Adobe's outage, Typing is Not A Cause

Joyent and Adobe have the outages for the month of May.

Joyent post their post-mortem.

Adobe posts an apology.

Both of these outages occurred during maintenance where someone type a command that did what it was supposed to do impacting a service.  The human perception problem is the person who typed the command could not see/perceive the system wide impact of their command.

Adobe doesn’t provide any details on what they plan to do.  Joyent does.

We will be taking several steps to prevent this failure mode from happening again, and ensuring that other business disaster scenarios are able to recover more quickly.

First, we will be dramatically improving the tooling that humans (and systems) interact with such that input validation is much more strict and will not allow for all servers, and control plane servers to be rebooted simultaneously. We have already begun putting in place a number of immediate fixes to tools that operators use to mitigate this, and we will be rethinking what tools are necessary over the coming days and weeks so that "full power" tools are not the only means by which to accomplish routine tasks.

You can imagine the efforts people will go through to create safeguards to eliminate the possibility of this type of outage.  Unfortunately, this will also most likely put a burden on day to day operations.

Another way to solve the problem is to give people the ability to see the impact of their actions.  No one in their right mind would execute a command to reboot all the servers at Joyent.  And no one in their right mind would delete a directory with all user records.

Watching Google's Data Center Machine Learning News spread

I was curious on how Google’s Data Center Machine Learning news would spread. 

At 1a on May 28, 2014 google posted on its main company blog with this kind of traffic over the past two days.

NewImage

The following are three posts that went live at 1a PT May 28, 2014 as well with the google post and they were able to interview Joe Kava, VP of Data Centers

http://gigaom.com/2014/05/28/google-is-harnessing-machine-learning-to-cut-data-center-energy/

Google’s head of data center operations, Joe Kava, says that the company is now rolling out the machine learning model for use on all of its data centers. Gao has spent about a year building it, testing it out and letting it learn and become more accurate. Kava says the model is using unsupervised learning, so Gao didn’t have to specificy the interactions between the data is — the model will learn those interactions over time.

http://www.datacenterknowledge.com/archives/2014/05/28/google-using-machine-learning-boost-data-center-efficiency/

http://www.wired.com/2014/05/google-data-center-ai/

The Wired article spun the machine learning as an Artificial Brain which gave them more traffic than others.

NewImage

But as I wrote Google’s machine learning is not really AI the way people would think.

BTW, in looking at the other articles, I realized my mistake.  In my post at 1a on May 28, I was a total nerd and got focused on the technology and didn’t mention Joe Kava’s name in my post even though I had interviewed him.  Damn.

Throughout the day the rest of the tech media were able to add their own posts.  I don’t know about you, but I am pretty impressed that Google was able to execute a media strategy that got the range of tech media to post on its Going Beyond PUE with Machine Learning.  PUE is not something widely discussed beyond the data center crowd.

Note the ComputerWeekly post was at the event where Joe Kava Keynoted and got 10 minutes of Joe’s time.  

My 10 minutes with Google's datacentre VP

ComputerWeekly.com (blog) - ‎May 28, 2014‎
Google's Joe Kava speaking at the Google EU Data Center Summit (Photo credit: Tom Raftery) ... Google's network division, which is the size of a medium enterprise, had a technology refresh and by spending between $25,000 and $50,000 per site, we could improve their high availability features and improve their PUEs from 2.2 to 1.5. The savings ... As more volumes of data are created and as mass adoption of the cloud takes place, naturally it will require IT to think about datacentres and its efficiency differently.
 

Google Blog: Better Data Centers Through Machine Learning

PCBDesign007 - ‎May 28, 2014‎
It's no secret that we're obsessed with saving energy. For over a decade we've been designing and building data centers that use half the energy of a typicaldata center, and we're always looking for ways to reduce our energy use even further. In our pursuit ...
 

Google is improving its data centers with the power of machine learning

GeekWire - ‎May 28, 2014‎
google-datacenter-tech-05 In its continuing quest to improve the efficiency of its data centers, Google has found a new solution: machine learning. Jim Gao, an engineer on the company's data center team, has been hard at work on building a model of how ...

Google crafts neural network to watch over its data centers

Register - ‎May 28, 2014‎
The project began as one of Google's vaunted "20 per cent projects" by engineer Jim Gao, who decided to apply machine learning to the problem of predicting how the power usage effectiveness of Google's data centers would change in response to tweaking ...
 

Google's Machine Learning: It's About More Than Spotting Cats

Wall Street Journal (blog) - ‎May 28, 2014‎
Google said in a blog post Wednesday that it is using so-called neural networks to reduce energy usage in its data centers. These computer brains are able to recognize patterns in the huge amounts of data they are fed and “learn” how things like air ...
 

Google data centers get smarter all on their own -- no humans required

VentureBeat - ‎May 28, 2014‎
While most of us were thinking that research would turn out speech recognition consumer products, it actually turns out that Google has applied its neural networks to the challenge of making its vast data centers run as efficiently as possible, preventing the ...
 

Google AI improves datacentre energy efficiency

ComputerWeekly.com - ‎May 28, 2014‎
“Realising that we could be doing more with the data coming out of datacentres, Jim studied machine learning and started building models to predict – and improve – datacentre performance.” The team's machine learning model behaves like other machine ...
 

Google taps machine learning technology to zap data center electricity costs

Network World (blog) - ‎May 28, 2014‎
Google is using machine learning technology to forecast - with an astounding 99.6% accuracy -- the energy usage in its data centers and automatically shift power to certain sites when needed. Using a machine learning system developed by its self ...
 

Google's machine-learning data centers make themselves more efficient

Ars Technica - ‎May 28, 2014‎
Google's data centers are famous for their efficient use of power, and now they're (literally) getting even smarter about how they consume electricity. Google today explained how it uses neural networks, a form of machine learning, to drive energy usage in its ...
 

Google is harnessing machine learning to cut data center energy

Bayoubuzz - ‎May 28, 2014‎
Leave it to Google to have an engineer so brainy he hacks out machine learning models in his 20 percent time. Google says that recently it's been using machine learning — developed by data center engineer Jim Gao (his Googler nickname is “Boy Wonder”) ...
 

Google turns to machine learning to build a better datacentre

ZDNet - ‎May 28, 2014‎
"The application of machine learning algorithms to existing monitoring data provides an opportunity to significantly improve DC operating efficiency," Google'sJim Gao, a mechanical engineer and data analyst, wrote in a paper online. "A typical large-scale ... These models can accurately predict datacentre PUE and be used to automatically flag problems if a centre deviates too far from the model's forecast, identify energy saving opportunities and test new configurations to improve the centre's efficiency. "This type of ...
 
 
 

Does Google's Data Center Machine Language Model have a debug mode? It should

I threw two posts(1st post and 2nd post) up on Google’s use of Machine Language in the Data Center and said I would write more.  Well here is another one.

Does Google’s Data Center Machine Language Model have a debug mode?  The current system describes the use of data collected every 5 minutes over about 2 years.

 184,435 time samples at 5 minute resolution (approximately 2 years of operational data

One of the methods almost no one does is debug their mechanical systems as if you were debugging software. 

Debugging is a methodical process of finding and reducing the number of bugs, or defects, in a computer program or a piece of electronic hardware, thus making it behave as expected. Debugging tends to be harder when various subsystems are tightly coupled, as changes in one may cause bugs to emerge in another.

What would debugging mode look like in DCMLM (my own acronym for Data Center Machine Language Model)?  You are seeing performance that looks like the subsystem is not performing as expected.  Change the sampling rate to 1 second.  Hopefully the controller will function correctly at a higher sample rate.  The controller may work fine, but the transport bus may not.  With the 1 second fidelity make changes to settings and collect data.  Repeat changes.  Compare results.  Create other stress cases.

What will you see?  From the time you make the changes in a setting how long does it take for you to achieve the desired state.  At the 5 minute sampling you cannot see the transition and the possibly delays.  Was the transition smooth or a step function.  Was there an overshoot in value and then corrections?

The controllers have code running in them, sensors go bad, wiring connections are intermittent.  How do you find these problems?  Being able to go into Debug mode could be useful.

If Google was able to compare detailed operations of two different installations of the same mechanical system, then they could find whether there was a problem that is unique to a site.  Or they may simply compare the same system at different points of time.

Apple's Acquisition of Beats may be beyond sound

Remember when people were wearing bluetooth ear pieces.  I have a couple that I haven’t used for years.  Why?  Sound Quality, battery life, and connectivity.  I have enough problems keeping a cell connection, let along worrying about battery life and bluetooth. So, I always use a hardwire headset.  It’s probably greener too because you eliminate the use bluetooth.

Now that Apple has officially announced the acquisition of Beats, the media and analysts are asking why?

Katy Huberty, Morgan Stanley: Apple beats the service drum."Subscription music service could make the deal a home run, with every 1% penetration of Apple's 800M account base equating to $960M of revenue. Apple believes Beats offers the right strategy for streaming music as it leverages both algorithms and 200 human curators to create playlists, which differentiates it from competitors."

Scott Craig, Merrill Lynch: Expensive acquisition. "The $3bn price tag sounds high, especially given Carlyle Group's investment in Beat Electronics in September 2013 valued the company at ~$1bn. This seems out of character with Apple's track record of acquisitions which are typically more tuck-in types with focus on IP/technology, rather than brand. Apple would also be acquiring high end, high margin headphones which according to NPD, account for 27% of the headphone market and 57% of premium headphone market ($99+)."

William Powers, Baird: Technology portfolio light. "We would note that the recently launched Beats Music streaming service is reported to have a mere ~250,000 subscribers and only has two awarded patents. Beats Electronics has 20 awarded patents, most of which are for ornamental design, vs. Harman's 868 patents and Bose's 558."

Given the pervasive use of headsets it is a standard accessory.  My daughter demanded a Beats headset, I have Klipsch earbuds and use Apple’s standard ones.  It is time to change what a headset is.  You can embed sensors for heart rate.  It’s stuck in your ear.  Probably can read body temperature.  Put in motion sensors and you can figure out what way people are looking.  Put in a small camera and you can do point of view.

If Apple wanted to change the headset into an extension of the iPhone what company should it buy? an established company in sound?  Or someone who is creating new business models, like Beats?

I laugh when people only see what is in front of them.  Beats is a headphone and streaming music company.  No, Beats is a company that has successfully done things with headsets and has created a must have accessory for actors, athletes, and other high visibility people.  It’s not about sound, it’s about looking good.

Imagine if the high performing athletes used Apple/Beats headphones as their training routine, then sell 1,000x more to amateur athletes.  Could be why Nike decided to get out of the fuel band business.

Could Beats create a sunglasses with a headset and sensor technology?  Oops, this beats Google Glasses and it is cool.   Wearing shades inside, while watching video  is going to look so much better on an Apple/Beats glasses than on Google glasses.

Google's Machine Learning Application is a Tool for AI, but not AI

Not AI, it is machine learning a tool to support mathematical model of a data centers mechanical systems

 

If you Google Image Search “artificial intelligence” you see these images.

NewImage

This is not what Google’s data center group has built with an application of machine learning.  When you Google Image Search “neural network” you see this.

 

NewImage

 

Google’s method to improve the efficiency of its data centers optimizes for cost is a machine learning application, not as covered in the media an artificial intelligence system. Artificial Intelligence is easy for many to assume the system thinks.   Google’s Machine Learning takes 19 inputs, then creates a predicted PUE with 99.6% accuracy and the settings to achieve that PUE.

 

Problems to be solved

  1. The interactions between DC Mechanical systems and various feedback loops make it difficult to accurately predict DC efficiency using traditional engineering formulas.
  2. Using standard formulas for predictive modeling often produces large errors because they fail to capture such complex interdependencies.
  3. Testing each and every feature combination to maximize efficiency would be unfeasible given time constraints, frequent fluctuations in the IT load and weather conditions, as well as the need to maintain a stable DC environment.

These problems describe the difficulty to build a mathematical model of the system.

 

Why Neural Networks? 

To address these problems, a neural network is selected as the mathematical framework for training DC energy efficiency models. Neural networks are a class of machine learning algorithms that mimic cognitive behavior via interactions between artificial neurons [6]. They are advantageous for modeling intricate systems because neural networks do not require the user to predefine the feature interactions in the model, which assumes relationships within the data. Instead, the neural network searches for patterns and interactions between features to automatically generate a best fit model.

 

There are 19 different factors input that are inputs to the neural network

1. Total server IT load [kW]

2. Total Campus Core Network Room (CCNR) IT load [kW]

3. Total number of process water pumps (PWP) running

4. Mean PWP variable frequency drive (VFD) speed [%]

5. Total number of condenser water pumps (CWP) running

6. Mean CWP variable frequency drive (VFD) speed [%]

7. Total number of cooling towers running

8. Mean cooling tower leaving water temperature (LWT) setpoint [F]

9. Total number of chillers running

10. Total number of drycoolers running

11. Total number of chilled water injection pumps running

12. Mean chilled water injection pump setpoint temperature [F]

13. Mean heat exchanger approach temperature [F]

14. Outside air wet bulb (WB) temperature [F]

15. Outside air dry bulb (DB) temperature [F]

16. Outside air enthalpy [kJ/kg]

17. Outside air relative humidity (RH) [%]

18. Outdoor wind speed [mph]

19. Outdoor wind direction [deg]

 

There are five hidden layers with 50 nodes per layer.  The hidden layers are the blue circles in the below diagram.  The red circles are the 19 different inputs.  The yellow circle is the output predicted PUE.

NewImage

Multiple iterations are run to reduce cost.  The cost function is below.

NewImage

 

Results:

A machine learning approach leverages the plethora of existing sensor data to develop a mathematical model that understands the relationships between operational parameters and the holistic energy efficiency. This type of simulation allows operators to virtualize the DC for the purpose of identifying optimal plant configurations while reducing the uncertainty surrounding plant changes.

 

NewImage

 

Note: I have used Jim Gao’s document with some small edits to create this post.