One thing that drives Jeff Dean's Google innovations, Necessity

June 23, 2013 Dave Ohara

Google's Jeff Dean spoke at GigaOm Structure. Many of you probably don't know who Jeff Dean is, so let's start with who Jeff is referring to a slate article.

The programs that Dean was instrumental in building—MapReduce, BigTable, Spanner—are not the ones most Google users associate with the company. But they’re the kind that made Google—and, consequently, much of the modern Web as we know it—possible. And the projects he’s working on now have the potential to revolutionize information technology once again.

Jeff Dean is among the most valued contributors.

But a great software developer can do in a week what might take months for a team of 10 lesser developers—the difference is exponential rather than marginal.

Dean is amongst those who think about performance.

And as a Ph.D. student in computer science, he worked on compilers, programs that translate source code into a language that a computer can readily execute. “I’ve always liked code that runs fast,” he explains matter-of-factly.

The GigaOm post on Jeff Dean is here. I got a chance to chat with Dean a bit and one of the points he shared in our conversation and repeated on stage is the necessity of systems he built.

I think one of the things that have caused us to build infrastructure as we were often doing things out of necessity, so we would be running into problems where we needed some infrastructure that would solve that problem in a way that could make it so that it can scale to deal with larger amounts of data or larger amounts of requests volumes and all of these kinds of things. There’s nothing like necessity of needing to do something to cause you to come up with abstractions that help you break through the forms. So map reduce was born out of needing to scale our indexing system.

"Necessity is the mother of invention" is a well known term. How many times are there features that people really don't think are important. Optional, take it or leave. They are not a necessity. To develop a feature of necessity, something everyone will eventually use is a challenge and comes with looking at the big picture and spending a lot of time thinking before coding. The Slate article closes with...

If Dean has a superhuman power, then, it’s not the ability to do things perfectly in an instant. It’s the power to prioritize and optimize and deal in orders of magnitude. Put another way, it’s the power to recognize an opportunity to do something pretty well in far less time than it would take to do it perfectly. In Silicon Valley, that’s much cooler than shooting cowboys with an Uzi.

You can watch Jeff Dean in this video. For those of you who don't have the time or patience to watch the whole video, the one thing I got out talking to Jeff and watching his talk is his focus on the necessity of things that Google needs to do in its infrastructure. And, as others I know who have talked to Jeff, he is a nice guy who just happens to be Google employee #20.

Google publishes ideas discussing Good Enough approach to achieve low latency

June 18, 2013 Dave Ohara

It can be really hard to get the media to publish complex concepts which is why companies will submit their own articles. Google's Luiz Barroso and Jeff Dean have an article on Google's Data Center challenge to provide low latency performance at scale.

The Tail at Scale

By Jeffrey Dean, Luiz André Barroso
Communications of the ACM, Vol. 56 No. 2, Pages 74-80
10.1145/2408776.2408794

Systems that respond to user actions quickly (within 100ms) feel more fluid and natural to users than those that take longer.³Improvements in Internet connectivity and the rise of warehouse-scale computing systems² have enabled Web services that provide fluid responsiveness while consulting multi-terabyte datasets spanning thousands of servers; for example, the Google search system updates query results interactively as the user types, predicting the most likely query based on the prefix typed so far, performing the search and showing the results within a few tens of milliseconds. Emerging augmented-reality devices (such as the Google Glass prototype⁷) will need associated Web services with even greater responsiveness in order to guarantee seamless interactivity.

The article can be long for most and here are two key points.

In large information-retrieval (IR) systems, speed is more than a performance metric; it is a key quality metric, as returning good results quickly is better than returning the best results slowly. Two techniques apply to such systems, as well as other to systems that inherently deal with imprecise results:

Good enough. In large IR systems, once a sufficient fraction of all the leaf servers has responded, the user may be best served by being given slightly incomplete ("good-enough") results in exchange for better end-to-end latency. The chance that a particular leaf server has the best result for the query is less than one in 1,000 queries, odds further reduced by replicating the most important documents in the corpus into multiple leaf servers. Since waiting for exceedingly slow servers might stretch service latency to unacceptable levels, Google's IR systems are tuned to occasionally respond with good-enough results when an acceptable fraction of the overall corpus has been searched, while being careful to ensure good-enough results remain rare. In general, good-enough schemes are also used to skip nonessential subsystems to improve responsiveness; for example, results from ads or spelling-correction systems are easily skipped for Web searches if they do not respond in time.

Google has used a technique like sticking your toe in the water to test out an environment before jumping. They call it a canary request.

Canary requests. Another problem that can occur in systems with very high fan-out is that a particular request exercises an untested code path, causing crashes or extremely long delays on thousands of servers simultaneously. To prevent such correlated crash scenarios, some of Google's IR systems employ a technique called "canary requests"; rather than initially send a request to thousands of leaf servers, a root server sends it first to one or two leaf servers. The remaining servers are only queried if the root gets a successful response from the canary in a reasonable period of time. If the server crashes or hangs while the canary request is outstanding, the system flags the request as potentially dangerous and prevents further execution by not sending it to the remaining leaf servers. Canary requests provide a measure of robustness to back-ends in the face of difficult-to-predict programming errors, as well as malicious denial-of-service attacks.

The canary-request phase adds only a small amount of overall latency because the system must wait for only a single server to respond, producing much less variability than if it had to wait for all servers to respond for large fan-out requests; compare the first and last rows in Table 1. Despite the slight increase in latency caused by canary requests, such requests tend to be used for every request in all of Google's large fan-out search systems due to the additional safety they provide.

Google's Server Environment is not as homogenous as you think, up to 5 microarchitectures

June 14, 2013 Dave Ohara

There is a common belief that Google, Facebook, Twitter and any of the newer Web 2.0 companies have it easier because they have homogeneous environments vs. a typical enterprise. Well, Google has a paper that discusses how its homogenous Warehouse-scale computers are actually heterogenous and there is opportunity for performance improvements of up to 15%.

In this table Google lists the number of micro architectures in 10 different data centers. Now Google has 13 WSCs so this could show how old this analysis was run (maybe 2-3 yrs ago.) Or it could have been more recently and they dropped 3 data centers out of the table. The 13th just came on line over the past year and would probably not have enough data.

The issue that is pointed out in the paper is that the job manager assumes the cores are homogenous.

When in fact they are not.

Here is the results summary.

Results Summary: This paper shows that there is a

significant performance opportunity when taking advantage

of emergent heterogeneity in modern WSCs. At the scale of

modern cloud infrastructures such as those used by companies

like Google, Apple, and Microsoft, gaining just 1% of

performance improvement for a single application translates

to millions of dollars saved. In this work, we show that largescale

web-service applications that are sensitive to emergent

heterogeneity improve by more than 80% when employing

Whare-Map over heterogeneity-oblivious mapping. When

evaluating Whare-Map using our testbed composed of key

Google applications running on three types of production

machines commonly found co-existing in the same WSC, we

improve the overall performance of an entire WSC by 18%.

We also find a similar improvement of 15% in our benchmark

testbed and in our analysis of production data from WSCs

hosting live services.

Here are three different microarchitectures used in the paper - Table 3 is production. Table 4 is a test bed.

Here are the range in performance for the three different micro architectures.

The new job scheduler is deployed at Google and here are results.

Figure 11 shows the calculated

performance improvement when using Whare-Map over the

currently deployed mapping in 10 of Google’s active WSCs.

Even though some major applications are already mapped

to their best platforms through manual assignment, we have

measured significant potential improvement of up to 15%

when intelligently placing the remaining jobs. This performance

opportunity calculation based on this paper is now

an integral part of Google’s WSC monitoring infrastructure.

Each day the number of ‘wasted cycles’ due to inefficiently

mapping jobs to the WSC is calculated and reported across

each of Google’s WSCs world wide.

There is more in the paper I need to digest, but I need to finish this post as it is long enough already.

How big is a Google Cluster cel? over 10,000 servers

June 13, 2013 Dave Ohara

Someone asked me how big a Google Cluster can be? I found the answer in a Google Research paper on flash storage top of page 8.

Don't know if the number really helps the person who asked, but figured it couldn't hurt to point at the paper.

Google shares its 10-20% Server performance improvement technique, analyzing micro architecture of AMD and Intel Servers

June 13, 2013 Dave Ohara

If you told someone in the data center industry you could get 10-20% performance gain, people wouldn't believe you. If you said you had a new processor, memory, storage, or network architecture, you would have a higher chance of people thinking you tell the truth. Would you believe someone if they told you at the micro architecture level of servers, if you designed the software to access local memory vs. non-local memory on existing systems you could get a 10-20% performance gain? Well Google has shared this information and is deploying the solution in its data centers.

This indicates

that a simple NUMA-aware scheduling can already

yield sizable benefits in production for those platforms.

Based on our findings, NUMA-aware thread mapping is

implemented and in the deployment process in our production

WSCs.

Here is the Google Paper published in 2013. Warning this is not an easy paper to read if you are not familiar with operating systems and hardware. But, I hope it gives an appreciation of another way to green a data center by making some changes in software.

Optimizing Google's Warehouse Scale Computers: The NUMA Experience

Abstract: Due to the complexity and the massive scale of modern warehouse scale computers (WSCs), it is challenging to quantify the performance impact of individual microarchitectural properties and the potential optimization benefits in the production environment. As a result of these challenges, there is currently a lack of understanding of the microarchitecture-workload interaction, leaving potentially significant performance on the table.
This paper argues for a two-phase performance analysis methodology for optimizing WSCs that combines both an in-production investigation and an experimental load-testing approach. To demonstrate the effectiveness of this two-phase methodology, and to illustrate the challenges, methodologies, and opportunities in optimizing modern WSCs, this paper investigates the impact of non-uniform memory access (NUMA) for several Google's key web-service workloads in large-scale production WSCs. Leveraging a newly-designed metric and continuous large-scale profiling in live datacenters, our production analysis demonstrates that NUMA has a significant impact (10-20%) on two important webservices: Gmail backend and search frontend. Our carefully designed load-test further reveals surprising tradeoffs between optimizing for NUMA performance and reducing cache contention.