Consequences of an Inefficient Information Factory aka Data Center

I posted on the concept of data centers being information factories.  Philip Petersen of www.adinfa.com wrote.

But when you mention "companies like Google" - are there really many companies like Google? I don't think so - not today.
Best,
Philip

I’ve actually had a few skype conversations with Philip and last year at Data Center Dynamics London I met Philip. So, I know he is a regular reader.

I agree there are not many companies like Google.  Here are a few things I think Google does that fit the model of an information factory.

  1. Urs Hoelzle as executive and influential in the company running data centers understands the role of Google’s information factories.  Once I asked Urs why he doesn’t shut down idle servers, his response was he would rather think how does he use the servers while they are idle.  And, Urs can think this way given his position and influence in Google.  What Google knows that few do is turning on and off servers, is not reliable enough for a lights-off type of operation. Desktops, laptops, mobile devices, and phones all have this problem as well, but people are pushing the buttons and can try again when turning on fails.
  2. Their focus on PUE accuracy and reporting demonstrates their thinking in process control and statistical accuracy.
  3. Google knows the shell of a building is cheap, and 85% of the costs in data centers is in the power and cooling infrastructure.
  4. The cost of electricity is greater the the cost of a server over a typical 3 year lifespan.
  5. The vendors – data center, server, network are just as silo’d as companies IT organizations and don’t drive for overall system efficiencies.  So, Google designs their own systems, and uses the vendors as subcontractors to their designs.  It may not be totally accurate analogy, but Boeing designs the plane and subcontracts out pieces and components.  There are some pieces that are off the shelf, like engines (for servers processors), but many time parts don’t perform as advertised when integrated.

I could go on, but these are just a few ideas that demonstrates Google runs their data centers as computers, see this paper.  The information factory metaphor communicates the scale, power, and complexity.

image 

 

As Philip says there are not many like Google.  Which means they have inefficient information factories that are a drain on the companies revenue.  And, in this economy cost reduction is a priority.  Do you cut costs by making the system more efficient?  No.  You cut costs by limiting headcount, budgets and capital expenditures which ironically many times will increase your costs long term as you grow and decrease your overall performance per watt.  Right now many companies don’t need the performance so you are removing capacity from the system to cut costs.  Makes sense, but I bet the company executives did not consciously decide to reduce the capacity of their information factory.  How can you not think reducing costs capital and operating expenses reduces capacity?

In this economy Google may reduce the rate of their expansion, but overall their information factory capacity is growing and performance per watt is improving.

Philip asked a good question.  “are there really many companies like Google”? 

No, but there will be more.  Google can do this because the company itself is an information factory.  And, the future successes in internet services will be those who have the most efficient information factories that can produce information at the lowest costs.

Read more

Data Center Site Selection – NC rides Apple and Google Wave

Hickoryrecord.com has an article highlighting 5 North Carolina counties promoting the data center corridor. Below is a picture of Apple’s under construction data center.

5 counties promote data center corridor

Robert C. Reed

image

Construction is under way on the $1 billion Apple data center in Maiden.

By John  Dayberry | Hickory Daily Record

Published: October 28, 2009

Maiden - Scott Millar said establishing an information technology corridor stretching northwest from Charlotte could transform the region's economy.

"Partnering with Caldwell, Burke, Alexander and Iredell counties to market this to the world may give all the counties new business opportunities," said Millar, president of the Catawba County Economic Development Corp.

But what I found mind blowing is there were 40 site selection consultants at the event.

On Tuesday and Wednesday, Millar and other economic development officials from the five counties outlined plans for a North Carolina data center corridor during a marketing event that attracted nearly 40 U.S. site selection consultants specializing in data center locations.

The regional economic development group is riding the Google and Apple and wave.

California-based Google opened a $600 million data center in Caldwell County in 2008.

When Apple announced plans for its $1 billion Maiden data center in July, economic development officials saw magnified potential for a data center corridor in the region.

Apple's arrival in the region also heightened interest on the part of site selection consultants from New York, Chicago, Atlanta, Washington, D.C., and other cities, Millar said.

Attendance at the Data Center Information Exchange blossomed.

"Eight (consultants) came the first year, 18 came last year and 38 came this year," Millar said.

"We're getting attention."

But do you think 40 site selection consultants know how to pick data center sites.  I would maybe guess 4 out of the 40 really know what they are doing.  But, how do you find the people who know what they are doing?  Do you think these guys know where Apple and Google is going to buy next?  Why buy where Apple and Google already bought?

There is a even a site selection magazine.

Site Selection Magazine, a nationally recognized publication, recently acknowledged a region anchored by financial data centers in Charlotte, Apple in Catawba County, Google in Caldwell County and the state's data center in Rutherford County as an emerging data center cluster that is attracting attention within the industry.

The site is here.

And, while all these site selection consultants were in NC, I was in Missouri, arriving in Kansas City, stopping in Columbia, and about to head to St. Louis.  One of the appealing parts of Missouri is the learning infrastructure.

Read more

Google positions itself #1 in Green Data Centers, hosts Secretary of Energy

cnet news has a post on US Secretary of Energy Steven Chu with Google CEO Eric Schmidt.

Google's warm reception for secretary of energy

by Tom Krazit

Google CEO Eric Schmidt (left) and U.S. Secretary of Energy Steven Chu at Google headquarters Monday.

(Credit: James Martin/CNET)

MOUNTAIN VIEW, Calif.--For a bunch of search engineers, Google employees care an awful lot about energy and the environment.

Google hosted an event for employees Monday featuring Steven Chu, the U.S. secretary of energy under President Obama and a man Chief Executive Eric Schmidt said "may become one of the most influential scientists of our generation, if he isn't already." Chu took about an hour to speak to a packed room of Google employees followinghis announcement of $151 million in funding for new energy-related projects as part of the ARPA-E program.

Part of the format has Schmidt interviewing Chu.

Schmidt, who serves as an adviser to the administration on President Obama's Council of Advisers on Science and Technology, asked Chu what it's like being the senior scientist in the government. He's actually the first scientist to hold the secretary of energy position, and won the Nobel Prize in Physics in 1997.

"It's funny in a macabre sort of way. I don't think Congress treats me like your average cabinet member," Chu said with a wry chuckle. He said he's spent much of his first year on the job talking to Congress about the problems with energy use and the environment, and that legislators are receptive, for the most part.

"I think the president has made it very clear that science plays such an integral role in the decisions we have to make," Chu said. He was preaching to the choir at the Googleplex.

On a regular basis I hear Green IT is a fad and not important.  Google has done a great job of providing a way for its staff to work together to use less energy for Google services. 

What those people who think Green IT is a fad miss is having your staff focus on making things greener, means you have benchmarked your performance. And continually evaluate new ways to reduce energy consumption and reduce your carbon footprint.  This saves money over the long haul and makes it easier to provide new services.

The winners in internet services are going to go to those who have the highest performance per watt.  Google is in a race and many think the race isn’t worth the effort.  Amazon gets it. Who else?

I bet you Eric Schmidt is helping the federal gov’t understand how much more efficient it would be to host services in the Google cloud vs federal data centers.

Can Google be the lowest cost utility for data center services?  Who is competing with Google to be the lowest cost?  The lowest cost provider will be the most efficient using energy.

Being the greenest is another way to say you are the lowest cost provider of IT services.

Still think Green IT will be a fad?

Read more

Google’s Secret to efficient Data Center design – ability to predict performance

DataCenterKnowledge has a post on Google’s (Public, NASDAQ:GOOG) future envisioning 10 million servers.

Google Envisions 10 Million Servers

October 20th, 2009 : Rich Miller

Google never says how many servers are running in its data centers. But a recent presentation by a Google engineer shows that the company is preparing to manage as many as 10 million servers in the future.

Google’s Jeff Dean was one of the keynote speakers at an ACM workshop on large-scale computing systems, and discussed some of the technical details of the company’s mighty infrastructure, which is spread across dozens of data centers around the world.

In his presentation (link via James Hamilton), Dean also discussed a new storage and computation system called Spanner, which will seek to automate management of Google services across multiple data centers. That includes automated allocation of resources across “entire fleets of machines.”

Going to Jeff Dean’s presentation, I found a Google secret.

image

Designs, Lessons and Advice from Building Large
Distributed Systems

Designing Efficient Systems
Given a basic problem definition, how do you choose the "best" solution?
• Best could be simplest, highest performance, easiest to extend, etc.
Important skill: ability to estimate performance of a system design
– without actually having to build it!

What is Google’s assumption of where computing is going?

image

Thinking like an information factory Google describes the machinery as servers, racks, and clusters.  This approach supports the idea of information production.  Google introduces the idea of data centers being like a computer, but I find a more accurate analogy is to think of data centers as information factories.  IT equipment are the machines in the factory, consuming large amounts of electricity for power and cooling the IT load.

 image

Located in a data center like Dalles, OR

image

With all that equipment things must break.  And, yes they do.

Reliability & Availability
• Things will crash. Deal with it!
– Assume you could start with super reliable servers (MTBF of 30 years)
– Build computing system with 10 thousand of those
– Watch one fail per day
• Fault-tolerant software is inevitable
• Typical yearly flakiness metrics
– 1-5% of your disk drives will die
– Servers will crash at least twice (2-4% failure rate)

The Joys of Real Hardware
Typical first year for a new cluster:
~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
~1 network rewiring (rolling ~5% of machines down over 2-day span)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky (40-80 machines see 50% packetloss)
~8 network maintenances (4 might cause ~30-minute random connectivity losses)
~12 router reloads (takes out DNS and external vips for a couple minutes)
~3 router failures (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for dns
~1000 individual machine failures
~thousands of hard drive failures
slow disks, bad memory, misconfigured machines, flaky machines, etc.
Long distance links: wild dogs, sharks, dead horses, drunken hunters, etc.

image

Monitoring is how you know your estimates are correct.

Add Sufficient Monitoring/Status/Debugging Hooks
All our servers:
• Export HTML-based status pages for easy diagnosis
• Export a collection of key-value pairs via a standard interface
– monitoring systems periodically collect this from running servers
• RPC subsystem collects sample of all requests, all error requests, all
requests >0.0s, >0.05s, >0.1s, >0.5s, >1s, etc.
• Support low-overhead online profiling
– cpu profiling
– memory profiling
– lock contention profiling
If your system is slow or misbehaving, can you figure out why?

Many people have quoted the idea “you can’t manage what you don’t measure.”  But a more advanced concept that Google discusses is “If you don’t know what’s going on, you can’t do
decent back-of-the-envelope calculations!”

Know Your Basic Building Blocks
Core language libraries, basic data structures,
protocol buffers, GFS, BigTable,
indexing systems, MySQL, MapReduce, …
Not just their interfaces, but understand their
implementations (at least at a high level)
If you don’t know what’s going on, you can’t do
decent back-of-the-envelope calculations!

This ideas being discussed are by a software architect, but the idea applies just as much to data center design.  And, the benefit Google has it has all of IT and development thinking this way.

image

And here is another secret to great design.  Say No to features.  But what the data center design industry wants to do is to get you to say yes to everything, because it makes the data center building more expensive increasing profits.

image

So what is the big design problem Google is working on?

image

Jeff Dean did a great job of putting a lot of good ideas in his presentation, and it was nice Google let him present some secrets we could all learn from.

Read more

Google Releases Q3 2009 PUE Numbers

Google just updated their PUE measurement page with Q3 2009 numbers.

Quarterly energy-weighted average PUE:
1.22

Trailing twelve-month energy-weighted avg. PUE: 
1.19

Individual facility minimum quarterly PUE:
1.15, Data Center B

Individual facility minimum TTM PUE*:
1.14, Data Center B

Individual facility maximum quarterly PUE:
1.33, Data Center H

Individual facility maximum TTM PUE*:
1.21, Data Center A

* Only facilities with at least twelve months of operation are eligible for Individual Facility TTM PUE reporting

What is nice is the Google guys have discussed their latest data center J even though it has only one data point.  Data Centers G, H, and I are mentioned as well as not being tuned yet.

image
Notes:

We added one new facility, Data Center J, to our PUE report. Overall, our fleet QoQ results were as expected. The Q3 total quarterly energy-weighted average PUE of 1.22 was higher than the Q2 result of 1.20 due to expected seasonal effects. The trailing twelve-month energy-weighted average PUE remained constant at 1.19. YoY performance improved from facility tuning and continued application of best practices. The quarterly energy-weighted average PUE improved from 1.23 in Q3'08, and the TTM PUE improved from 1.21. New data centers G, H, I, and J reported elevated PUE results as we continue to tune operations to meet steady-state design targets.

The Google guys know they are going to get critiqued on how good their numbers are, so they described their measurement methods and error analysis.

Measurement Methodology

The PUE of a data center is not a static value. Varying server and storage utilization, the fraction of design IT power actually in use, environmental conditions, and other variables strongly influence PUE. Thus, we use multiple on-line power meters in our data centers to characterize power consumption and PUE over time. These meters permit detailed power and energy metering of the cooling infrastructure and IT equipment separately, allowing for a very accurate PUE determination.  Our facilities contain dozens or even hundreds of power meters to ensure that all of the power-consuming elements are accounted for in our PUE calculation, in accordance with the metric definition6. Only the office space energy is excluded from our PUE calculations. Figure 3 shows a simplified power distribution schematic for our data centers.

image

Figure 3: Google Data Center Power Distribution Schematic

Equation for PUE for Our Data Centers

image

  • EUS1 Energy consumption for type 1 unit substations feeding the cooling plant, lighting, and some network equipment
  • EUS2 Energy consumption for type 2 unit substations feeding servers, network, storage, and CRACs
  • ETX Medium and high voltage transformer losses
  • EHV High voltage cable losses
  • ELV Low voltage cable losses
  • ECRAC CRAC energy consumption
  • EUPS Energy loss at UPSes which feed servers, network, and storage equipment
  • ENet1 Network room energy fed from type 1 unit substitution
Error Analysis

To ensure our PUE calculations are accurate, we performed an uncertainty analysis using the root sum of the squares (RSS) method.  Our uncertainty analysis shows that the overall uncertainty in the PUE calculations is less than 2% (99.7% confidence interval).  Our power meters are highly accurate (ANSI C12.20 0.2 compliant) so that measurement errors have a negligible impact on overall PUE uncertainty.  The contribution to the overall uncertainty for each term described above is outlined in the table below.

Term
Overall Contribution to Uncertainty

EUS1
4%

EUS2
9%

ETX
10%

ECRAC
70%

EUPS
<1%

EHV
2%

ELV
5%

ENet1
<1%

Read more