Machine Learning (ML) in Google’s Data Center, Jeff Dean shares details

Jeff Dean is one of Google’s amazing staff who works on data centers. He posted a presentation on ML that is here. Who is Jeff Dean? Here is a business insider article on Jeff. If you want a good laugh check out the jokes on Jeff Dean’s capabilities. I’ve been lucky to have a few conversations with Jeff and watched him up close which helps to read the ML presentation.

Below is a small fraction of what is in Jeff’s presentation. It is going to take me a while to digest it, and luckily I shared the presentation with one of my friends who has been getting into ML architecture and we are both looking at ML systems. 

Part of Jeff’s presentation is the application of ML in the data center. 


This slide doesn’t show up until 3/4 through the presentation, and to show you how important this slide is it shows up again in Jeff’s conclusion slide. 



So now that you have seen the end slide what is Jeff trying to do?  Kind of simple he wants a computational power beyond the limits of Intel Processors. Urs Hoelzle wrote a paper on the need for brawny cores to replace the direction for wimpy cores.


So what’s this look like? 



Look at the aisle shot. 


And here is shot of the TPU logic board with 4 TPUs. 


Google has a mindset perspective from its early days giving it an advantage over many

In 1998, Google had a $100k check from Andy Becholsheim. In 1998 you could buy between 5-15 Compaq Servers that were used for web content. To make a high Availability system you would have a hot spare which could mean you have 1/2 the available resources. Google took the path that few have taken back then to use consumer components.


Above is the 1st Google Servers. The first iteration of Google production servers was built with inexpensive hardware and was designed to be very fault-tolerant.

In 2013, Google published it Datacenters as a Computer paper.

A key part of this paper is discussion of hardware failure. 


The sheer scale of WSCs requires that Internet services software tolerate relatively high component fault rates. Disk drives, for example, can exhibit annualized failure rates higher than 4% [123, 137]. Di erent deployments have reported between 1.2 and 16 average server-level restarts per year. With such high component failure rates, an application running across thousands of machines may need to react to failure conditions on an hourly basis. We expand on this topic further on Chapter 2, which describes the application domain, and Chapter 7, which deals with fault statistics.”

Google has come a long ways from using inexpensive hardware, but what has been carried forward is how to deal with failures. 

Some may think 2 nodes in a system are required for high availability, but the smart ones know that you need 3 nodes and really want 5 nodes in the system. 

Google Tours Its Data Centers as part of the Cloud Battle

Google's Joe Kava presented at its Cloud Event.  Go to the 5 min mark in the below video.

Open a Chrome or Firefox browser and you can take this DC 360 degree tour.

Wouldn't it be great if Amazon and Microsoft responded in a similar way?

If you don't like watching videos there are a few news articles that report on the above.

Google finally told its most important cloud customers what they wanted to hear
Business Insider - ‎Mar 24, 2016‎
In fact, Kava claims that Google is the "world's largest private investor in renewable energy," with $2 billion given to wind and solar companies, as it tries to reduce its power consumption as much as it can. That's a cost savings that gets passed on ...

Google Cloud Platform's 3 keys to the roadmap: Data center, security, containers
TechRepublic - ‎22 hours ago‎
Joe Kava, one of the heads of Google's data center efforts, was the speaker who explained the company's strategy in the data center. Early on, there was a big push in the concept of "your data centers are Google's data centers," likely to position the ...
Google Cloud Platform touts investments in security, data centers, and containers

ZDNet - ‎Mar 24, 2016‎
DeMichillie then introduced data center head Joe Kava, who walked through Google's data center strategy. According to Kava, the core principles of Google's approach to data centers are availability, security, and performance. Kava explained the company ...

So is 48V DC a big deal for the Data Center? Google's OCP contribution

Google's Urs Hoelzle announcing the contribution of 48V and rack design at OCP summit was the news of the Summit.

Google's contribution is posted here.

Why is 48V a big deal? if you are pushing higher performing chips as Google's Urs Hoelzle has discussed in his paper on the need for brawny cores vs. wimpy cores.  There is a need for GPUs as mentioned in the post.

As the industry's working to solve these same problems and dealing with higher-power workloads, such as GPUs for machine learning, it makes sense to standardize this new design by working with OCP. We believe this will help everyone adopt this next generation power architecture, and realize the same power efficiency and cost benefits as Google.

Why would Google contribute the 48V DC design?  As one ex-Googler said at OCP Google wants to reduce the cost of the 48V DC converters and to do that they need more volume.  And Google has a history of sharing its innovations.  See the below timeline on Google's contributions