The Art of the Green Data Center – Wind, Water, and Energy

January 19, 2010 Dave Ohara

Julius Neudorfer wrote an article on Feng Shui and the Art of the Data Center.

Feng Shui and the Art of the Data Center

Data Center | Blog Post | Julius Neudorfer, Thursday, January 14, 2010

Tags: Cooling Systems, Green Technology

I am in the midst of designing a new data center for a client and have been trying to balance the requirements of costs, space limitations, maximum number of cabinets and the flexibility to meet the rising and ever-changing power density of the IT equipment loads. Of course, high energy efficiency is a given. In addition, the client is especially concerned about esthetics and how it will look when a visitor enters the room.

My wife was reading a book on Feng Shui and suggested that I begin to incorporate it into my thinking. Not having enough time to become a Feng Shui master by the project’s deadline, I did some quick reading and found this definition:

Feng Shui is an ancient art of placement to bring balance and harmony to a physical space. The loose translation of Feng Shui is 'wind and water.' Feng represents the wind that carries the chi (energy) throughout a space. Shui is the water that meanders underneath the earth transporting chi.

Julius makes the connection to data centers here.

And while I don’t think that they had a data center in mind when Feng Shui was first introduced hundreds of years ago, I found a strong parallel to the data center’s infrastructure in the definition. The three elements that are mentioned, wind (airflow), water (chilled water) and energy (power), all apply to the operation of a data center and of course, we also want to bring “balance and harmony to a physical space”.

This brings up an interesting view of the use of Wind vs Water – Air cooling vs. Water Cooling, a debate that reminds me of people arguing political views.

What most miss is in the spirit of Feng Shui your goal is to be in balance and harmony with the physical space. Some sites are better for wind, some for water while watching your energy use.

If I hear someone debate air vs. water for specific site, the conversation’s are usually resolved quickly with a mutual understanding.

Why is this debate so heated? Many times it is fueled by equipment vendors who have proxies (people) for their technology.

Architecture of Internet Datacenters

January 4, 2010 Dave Ohara

How many of you would like to attend a course on Architecture of Internet Data Centers? This course is part of RAD Lab who wrote the Adove the Clouds paper.

Well in Fall of 2007 UC Berkeley (my alma mater) had the following course for graduate students.

CS 294-14: Architecture of Internet Datacenters (RADLab Research Seminar 2.0)
Instructor: Randy H. Katz
Time: MW 2:30-4:00 PM
Place: 310 Soda
Units: 3 (2-4, but you had better sign up for 3!)
Course Description

Internet Datacenters have recently emerged as a significant new computing platform, designed to provide high capacity processing for large numbers of web clients. Major web properties like Google have designed their own building-scale computer facilities, integrating processing, storage, internal and external networking, along with integral power and cooling infrastructures. The resulting datacenters typically deploy 100,000 to 1,000,000 computers within a single facility.

In this research seminar, we will read and discuss the very recent literature on the design and implementation of processor clusters, virtual machines, virtual storage, and datacenter networking organization. Architectural approaches to deal with failures, effective sharing of processing/storage/network resources, and efficient management of power across the systems stack will be considered. Some class meetings will be dedicated to meeting with and discussing issues with industrial leaders from Google, IBM, Cisco, and Network Appliances.

Here are the first two weeks.

Week 1: Course Organization, Overview, and Technology Trends

Monday, August 27

[Randy] Randy H. Katz, “Internet-scale Computing: The Berkeley RADLab Perspective,” IWQoS 2007, Evanston, IL, (June 2007). [pdf]

[Randy] Stephen Alan Herrod, VMWare, “The Future of Virtualization Technology,” ISCA 2006. [pdf]

Wednesday, August 29

[Randy] Raj Yavatkar, Intel, “Platforms Design Challenges with Many Cores,” HPCA-12, 2006. [pdf]

[Randy] Renato Recio, IBM, “System IO Network Evolution: Closing the Requirement Gaps,” HPCA-12, 2006. [pdf]

[Randy] Steve Kleiman, NetApp, “Trends in Managing Data at the Petabyte Scale,” FAST 2007, San Jose, CA, (February 2007). [pdf]

Week 2: Applications Software Infrastructure

Monday, September 3: Labor Day Holiday

Wednesday, September 5
2:30-4:00

[Matei] S. Ghemawat, H. Gobioff, S.-T. Leung, “The Google File System,” Proc. SOSP’03, 2003. [pdf] [Notes].

[Kuang] J. Dean, S. Ghemawat, “Mapreduce: Simplified Data Processing on Large Clusters,” Proc. OSDI’04, pages 137 – 150, (December 2004). [pdf] [Notes].

[Michael] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, R. E. Gruber, “Bigtable: A Distributed Storage System for Structured Data,” Proc. OSDI'06, 2006. [pdf]

6:00-7:30

[Randy] Intel and Sun White Papers on Multicore Architectures [Notes]

Intel, "Intel Multi-Core Processors: Making the Move to Quad-Core and Beyond." [pdf]

Intel, "Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance." pdf

Intel, "Preparing for Peta-scale." [pdf]

Harlan McGhan, "Niagara 2 Opens the Flood Gates," Microprocessor, 11/6/2006. [pdf]

[Ari] L. A. Barroso, J. Dean, U. Holzle, “Web Search for a Planet: The Google Cluster Architecture,” IEEE Micro, 23(2):22–28, March/April 2003. [pdf] [Notes]

[Henry] L. A. Barroso, “The Price of Performance: An Economic Case for Chip Multiprocessing," ACM Queue, 3(7), September 2005. [html] [pdf].

What I found interesting was the student project proposals.

Student Project Proposal Presentations

Andy/Matei, "Monitoring Hadoop through Tracing"

Ari/Stephen, "Monitoring Infrastructure"

Ganesh, "Reducing Power Consumption of Internet Datacenters"

Jorge, "Sensing the Datacenter"

Junda/Terry/Thomas, "Reducing Server Power Under Application Response Time Constraints"

Kuang/Gunho/Byong, "Declarative Distributed Debugging"

Kurtis, "Machine Learning for Routing around Congestion Events"

Michael/David/Barret, "Scalable Structured Data Storage for Web 2.0"

Mohit/Andrey, "A Case for a Fault-Tolerant Virtual Machine"

Scott/Henry, "Reducing Memory Power Usage on a CMT System"

Tracy/Yanpei, "Energy Efficient Ethernet Encoding"

Google’s Secret to efficient Data Center design – ability to predict performance

October 23, 2009 Dave Ohara

DataCenterKnowledge has a post on Google’s (Public, NASDAQ:GOOG) future envisioning 10 million servers.

Google Envisions 10 Million Servers
October 20th, 2009 : Rich Miller
Google never says how many servers are running in its data centers. But a recent presentation by a Google engineer shows that the company is preparing to manage as many as 10 million servers in the future.

Google’s Jeff Dean was one of the keynote speakers at an ACM workshop on large-scale computing systems, and discussed some of the technical details of the company’s mighty infrastructure, which is spread across dozens of data centers around the world.

In his presentation (link via James Hamilton), Dean also discussed a new storage and computation system called Spanner, which will seek to automate management of Google services across multiple data centers. That includes automated allocation of resources across “entire fleets of machines.”

Going to Jeff Dean’s presentation, I found a Google secret.

Designs, Lessons and Advice from Building Large
Distributed Systems

Designing Efficient Systems
Given a basic problem definition, how do you choose the "best" solution?
• Best could be simplest, highest performance, easiest to extend, etc.
Important skill: ability to estimate performance of a system design
– without actually having to build it!

What is Google’s assumption of where computing is going?

Thinking like an information factory Google describes the machinery as servers, racks, and clusters. This approach supports the idea of information production. Google introduces the idea of data centers being like a computer, but I find a more accurate analogy is to think of data centers as information factories. IT equipment are the machines in the factory, consuming large amounts of electricity for power and cooling the IT load.

Located in a data center like Dalles, OR

With all that equipment things must break. And, yes they do.

Reliability & Availability
• Things will crash. Deal with it!
– Assume you could start with super reliable servers (MTBF of 30 years)
– Build computing system with 10 thousand of those
– Watch one fail per day
• Fault-tolerant software is inevitable
• Typical yearly flakiness metrics
– 1-5% of your disk drives will die
– Servers will crash at least twice (2-4% failure rate)

The Joys of Real Hardware
Typical first year for a new cluster:
~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
~1 network rewiring (rolling ~5% of machines down over 2-day span)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky (40-80 machines see 50% packetloss)
~8 network maintenances (4 might cause ~30-minute random connectivity losses)
~12 router reloads (takes out DNS and external vips for a couple minutes)
~3 router failures (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for dns
~1000 individual machine failures
~thousands of hard drive failures
slow disks, bad memory, misconfigured machines, flaky machines, etc.
Long distance links: wild dogs, sharks, dead horses, drunken hunters, etc.

Monitoring is how you know your estimates are correct.

Add Sufficient Monitoring/Status/Debugging Hooks
All our servers:
• Export HTML-based status pages for easy diagnosis
• Export a collection of key-value pairs via a standard interface
– monitoring systems periodically collect this from running servers
• RPC subsystem collects sample of all requests, all error requests, all
requests >0.0s, >0.05s, >0.1s, >0.5s, >1s, etc.
• Support low-overhead online profiling
– cpu profiling
– memory profiling
– lock contention profiling
If your system is slow or misbehaving, can you figure out why?

Many people have quoted the idea “you can’t manage what you don’t measure.” But a more advanced concept that Google discusses is “If you don’t know what’s going on, you can’t do
decent back-of-the-envelope calculations!”

Know Your Basic Building Blocks
Core language libraries, basic data structures,
protocol buffers, GFS, BigTable,
indexing systems, MySQL, MapReduce, …
Not just their interfaces, but understand their
implementations (at least at a high level)
If you don’t know what’s going on, you can’t do
decent back-of-the-envelope calculations!

This ideas being discussed are by a software architect, but the idea applies just as much to data center design. And, the benefit Google has it has all of IT and development thinking this way.

And here is another secret to great design. Say No to features. But what the data center design industry wants to do is to get you to say yes to everything, because it makes the data center building more expensive increasing profits.

So what is the big design problem Google is working on?

Jeff Dean did a great job of putting a lot of good ideas in his presentation, and it was nice Google let him present some secrets we could all learn from.

Data Center Myth – Thermal/Temperature Shock

October 9, 2009 Dave Ohara

Mike Manos has a post pointing out what he calls “data center junk science” and the data center thermal shock requirement.

Mike’s post got my curiosity up, and I spent time researching to build on Mike’s post. This is my 956th post in less than 2 years, and people many times think I have a journalism writing background. Well fooled you, I am an Industrial Engineer and Operations Research graduate from Cal Berkeley. So, even thought I write a lot, you are reading my notebook of stuff that I discover I want to share with others. For those of you who don’t want industrial engineers do.

Industrial engineering is a branch of engineering that concerns with the development, improvement, implementation and evaluation of integrated systems of people, money, knowledge, information, equipment, energy, material and process. It also deals with designing new prototypes to help save money and make the prototype better. Industrial engineering draws upon the principles and methods of engineering analysis and synthesis, as well as mathematical, physical and social sciences together with the principles and methods of engineering analysis and design to specify, predict and evaluate the results to be obtained from such systems. In lean manufacturing systems, Industrial engineers work to eliminate wastes of time, money, materials, energy, and other resources.

This background all helps me think of how to green the data center.

And Operations Research helps me think about the technical methods and SW to do this.

interdisciplinary branch of applied mathematics that uses methods such as mathematical modeling, statistics, andalgorithms to arrive at optimal or near optimal solutions to complex problems. It is typically concerned with determining the maxima (of profit, assembly line performance, crop yield, bandwidth, etc) or minima (of loss, risk, etc.) of some objective function. Operations research helps management achieve its goals using scientific methods.

Mike’s post got me thinking because one of my summer internships was at HP where I worked as a reliability/quality engineer figuring out how to build better quality HP products. The team I worked in were early innovators in thermal cycling and stressing components back in the early 1980’s.

Data Center Junk Science: Thermal Shock \ Cooling Shock

October 1, 2009 by mmanos

I recently performed an interesting exercise where I reviewed typical co-location/hosting/ data center contracts from a variety of firms around the world.    If you ever have a few long plane rides to take and would like an incredible amount of boring legalese documents to review, I still wouldn’t recommend it.

I did learn quite a bit from going through the exercise but there was one condition that I came across more than a few times.   It is one of those things that I put into my personal category of Data Center Junk Science.   I have a bunch of these things filed away in my brain, but this one is something that not only raises my stupidity meter from a technological perspective it makes me wonder if those that require it have masochistic tendencies.

I am of course referring to a clause for Data Center Thermal Shock and as I discovered its evil, lesser known counterpart “Cooling” Shock.    For those of you who have not encountered this before its a provision between hosting customer and hosting provider (most often required by the customer) that usually looks something like this:

If the ambient temperature in the data center raises 3 degrees over the course of 10 (sometimes 12, sometimes 15) minutes, the hosting provider will need to remunerate (reimburse) the customer for thermal shock damages experienced by the computer and electronics equipment. The damages range from flat fees penalties to graduated penalties based on the value of the equipment.

As Mike asks the issue of duration.

Which brings up the next component which is duration. Whether you are speaking to 10 minutes or 15 minutes intervals these are nice long leisurely periods of time which could hardly cause a “Shock” to equipment. Also keep in mind the previous point which is the environment has not even violated the ASHRAE temperature range. In addition, I would encourage people to actually read the allowed and tested temperatures in which the manufacturers recommend for server operation. A 3-5 degree swing in temperature would rarely push a server into an operating temperature range that would violate the range the server has been rated to work in or worse — void the warranty.

Here is the military specification typically used by vendors. MIL-STD- 810G to define temperature/thermal shock.

MIL-STD-810G
METHOD 503.5
METHOD 503.5
TEMPERATURE SHOCK

1.
SCOPE.
1.1
Purpose.
Use the temperature shock test to determine if materiel can withstand sudden changes in the temperature of the surrounding atmosphere without experiencing physical damage or deterioration in performance. For the purpose of this document, "sudden changes" is defined as "an air temperature change greater than 10°C (18°F) within one minute."
1.2
Application.
1.2.1
Normal environment.
Use this method when the requirements documents specify the materiel is likely to be deployed where it may experience sudden changes of air temperature. This method is intended to evaluate the effects of sudden temperature changes of the outer surfaces of materiel, items mounted on the outer surfaces, or internal items situated near the external surfaces. This method is, essentially, surface-level tests. Typically, this addresses:
a.
The transfer of materiel between climate-controlled environment areas and extreme external ambient conditions or vice versa, e.g., between an air conditioned enclosure and desert high temperatures, or from a heated enclosure in the cold regions to outside cold temperatures.
b.
Ascent from a high temperature ground environment to high altitude via a high performance vehicle (hot to cold only).
c.
Air delivery/air drop at high altitude/low temperature from aircraft enclosures when only the external material (packaging or materiel surface) is to be tested.

As Mike says the surprising part is the requirement for thermal shock is coming from technical people, most likely who have military backgrounds.

Even more surprising to me was that these were typically folks on the technical side of the house more then the lawyers or business people. I mean, these are the folks that should be more in tune with logic than say business or legal people who can get bogged down in the letter of the law or dogmatic adherence to how things have been done. Right? I guess not.

I can’t imagine any business person or attorney thinking a thermal shock is 3 degree change in 15 minutes. If there was an attorney involved they would go to MIL-STD 810G definition of temperature shock being greater than 10°C (18°F) within one minute.

So where does this myth come from? Most likely their is a social network effect of people who have consider themselves smarter than others and have added thermal shock to the requirements. One of the comments from Mike’s blog documents the possible social network source.

Dave Kelley, Liebert Precision Cooling

The only place where something like this is “documented” in any way is in the ASHRAE THermal Guidelines book. Since the group that wrote this book included all of the major server vendors, it must have been created with some type of justifiable reason. It states that the “maximum rate of temperature change is 5 degress C (9 degrees F) per hour.

And as Mike closes this has unintended consequences.

But this brings up another important point. Many facilities might experience a chiller failure, or a CRAH failure or some other event which might temporarily have this effect within the facility. Lets say it happens twice in one year that you would potentially trigger this event for the whole or a portion of your facility (your probably not doing preventative maintenance – bad you!). So the contract language around Thermal shock now claims monetary damages. Based on what? How are these sums defined? The contracts I read through had some wild oscillations on damages with different means of calculation, and a whole lot more. So what is the basis of this damage assessment? Again there are no studies that says each event takes off .005 minutes of a servers overall life, or anything like that. So the cost calculations are completely arbitrary and negotiated between provider and customer.

This is where the true foolishness then comes in. The providers know that these events, while rare, might happen occasionally. While the event may be within all other service level agreements, they still might have to award damages. So what might they do in response? They increase the costs of course to potentially cover their risk. It might be in the form of cost per kw, or cost per square foot, and it might even be pretty small or minimal compared to your overall costs. But in the end, the customer ends up paying more for something that might not happen, and if it does there is no concrete proof it has any real impact on the life of the server or equipment, and really only salves the whim of someone who really failed to do their homework. If it never happens the hosting provider is happy to take the additional money.

Temperature/thermal shock is a term that doesn’t apply to data centers. Hopefully you’ll know when to call temperature/thermal shock requirements in data center operations a myth.

Thanks Mike for taking the time to write on this.

Blogs vs. Twitter – Big vs. Small

October 8, 2009 Dave Ohara

Microsoft Research has an article discussing efforts by their researchers to understand blogs vs. twitter. At first I wasn’t going to blog this article, but it brings up interesting points to consider in data centers of the big vs. small. Which is a good question to ask in data centers. Is it best to be big or small in data centers?

Researchers Ride the Twitter Wave

By Rob Knies

August 6, 2009 2:00 PM PT

He rocks in the treetops all the day long,

Hoppin’ and a-boppin’ and a-singin’ his song.

All the little birds on Jaybird Street

Love to hear the robin go tweet tweet tweet …

* * *

When L.A. R&B singer Bobby Day took Jimmie Thomas’ lyrics to the top of the charts in the summer of 1958—a tune memorably revived in 1972 by a 13-year-old Michael Jackson—there was no way to foresee how those words would resonate a half-century later.

But they certainly do. Twitter, the wildly popular micro-blogging service, has become an Internet sensation, with millions flocking to the site each month to post a jittery stream of brief status updates. Whether it’s Ashton Kutcher or your cousin Sue, these days, it seems, everybody wants to emulate Rockin’ Robin.

There will be few people arguing for small data centers as the whole supply chain system is set up to maximize profits by building bigger more complex data centers. What is the right size for data centers? The problem is the data center construction teams think from their construction and provisioning view. The whole social network effect of what happens with something like Twitter is beyond the data center construction team.

“Blogging has long been studied as a medium of information diffusion, and micro-blogging has started to be used for marketing. Analyzing the differences and similarities in terms of information-diffusion structure and efficiency can yield valuable knowledge to the proper use of each.”

Aren’t data centers built for information diffusion?

What types of data centers are ideal for information diffusion? Bet you Facebook and Twitter can look at their data center as social networks instead of buildings.