Google shares its 10-20% Server performance improvement technique, analyzing micro architecture of AMD and Intel Servers

If you told someone in the data center industry you could get 10-20% performance gain, people wouldn't believe you.  If you said you had a new processor, memory, storage, or network architecture, you would have a higher chance of people thinking you tell the truth.  Would you believe someone if they told you at the micro architecture level of servers, if you designed the software to access local memory vs. non-local memory on existing systems you could get a 10-20% performance gain?  Well Google has shared this information and is deploying the solution in its data centers.

 This indicates

that a simple NUMA-aware scheduling can already

yield sizable benefits in production for those platforms.

Based on our findings, NUMA-aware thread mapping is

implemented and in the deployment process in our production


Here is the Google Paper published in 2013.  Warning this is not an easy paper to read if you are not familiar with operating systems and hardware.  But, I hope it gives an appreciation of another way to green a data center by making some changes in software.

Optimizing Google's Warehouse Scale Computers: The NUMA Experience

Abstract: Due to the complexity and the massive scale of modern warehouse scale computers (WSCs), it is challenging to quantify the performance impact of individual microarchitectural properties and the potential optimization benefits in the production environment. As a result of these challenges, there is currently a lack of understanding of the microarchitecture-workload interaction, leaving potentially significant performance on the table.

This paper argues for a two-phase performance analysis methodology for optimizing WSCs that combines both an in-production investigation and an experimental load-testing approach. To demonstrate the effectiveness of this two-phase methodology, and to illustrate the challenges, methodologies, and opportunities in optimizing modern WSCs, this paper investigates the impact of non-uniform memory access (NUMA) for several Google's key web-service workloads in large-scale production WSCs. Leveraging a newly-designed metric and continuous large-scale profiling in live datacenters, our production analysis demonstrates that NUMA has a significant impact (10-20%) on two important webservices: Gmail backend and search frontend. Our carefully designed load-test further reveals surprising tradeoffs between optimizing for NUMA performance and reducing cache contention.