Do you care more about Top Supercomputers in China and NSA or Massive Clusters at Google, Facebook, Microsoft, and Amazon

There is news that China has the world's record for Supercomputer.

The ten fastest supercomputers on the planet, in pictures

Chinese supercomputer clocks in at 33.86 petaflops to break speed record.

A Chinese supercomputer known as Tianhe-2 was today named the world's fastest machine, nearly doubling the previous speed record with its performance of 33.86 petaflops. Tianhe-2's ascendance was revealed in advance and was made official today with the release of the new Top 500 supercomputer list.

The media will gladly write about who has the biggest and most powerful supercomputer.

As one of my friends who has worked on supercomputer data centers said, we realized we could reduce a lot costs in the data center, because the super computer would often have weekly maintenance intervals as well as monthly and quarterly.  Components are constantly failing and yes there is a degree of isolation in the failures, but you need to eventually repair the failures which can mean a complete shut down.  During these shut downs is when data center maintenance can be performed.

But, at Google, Facebook, Microsoft, and Amazon there is no time to shut down services.  100,000s of servers need to run all the time.  

Amazon threw up a supercomputer entry in 2011, and it is still ranked 127.

ListRankSystemVendorTotal CoresRmax (TFlops)Rpeak (TFlops)Power (kW)
06/2013 127 Amazon EC2 Cluster, Xeon 8C 2.60GHz, 10G Ethernet Self-made 17,024 240.1 354.1  
11/2012 102 Amazon EC2 Cluster, Xeon 8C 2.60GHz, 10G Ethernet Self-made 17,024 240.1 354.1  
06/2012 72 Amazon EC2 Cluster, Xeon 8C 2.60GHz, 10G Ethernet Self-made 17,024 240.1 354.1  
11/2011 42 Amazon EC2 Cluster, Xeon 8C 2.60GHz, 10G Ethernet Self-made 17,024 240.1 354.1

Can you imagine if Google, Facebook, Microsoft, or Amazon put up their clusters as an entry?

Part of companies like Google has as advantage is they have teams of people led by guys like Jeff Dean to really think hard about compute clusters.  Here is a presentation Dean gave 4 years ago.

NewImage

NewImage

Google, Facebook, Microsoft, and Amazon are solving the problem to keep supercomputer performance running 24x7x365 a year.  I think this type of innovation affects us much more than who has the fastest supercomputer which requires hundreds of hours of downtime for maintenance.  

Watch out water shortages are getting worse, fracking is bidding for the water

Droughts are scattered around and typically agriculture gets first priority.  But, as RT.com reports Fracking is causing problems.

Fracking is occurring in several counties in Arkansas, Colorado, New Mexico, Oklahoma, Texas, Utah and Wyoming, which are currently suffering a severe drought, the Associated Press reports. Although the procedure requires less water than farming or overall residential uses, it contributes to the depletion of an already-scare resource.

Some oil and gas companies manage to drain states of their water supply without spending any money, by depleting underground aquifers or rivers. But when unable to acquire the resource for free, the corporations can purchase large quantities at hefty prices. 

“There is a new player for water, which is oil and gas,” Colorado farmer Kent Peppler told AP, noting that he is fallowing some of his corn fields because he can’t afford to irritate them. “And certainly they are in a position to pay a whole lot more than we are.”

Peppler, president of the Rocky Mountain Farmers Union, said that the price of water has skyrocketed since oil companies have moved in. The Meade, Colo. Farmer said he used to pay $9 to $100 per acre-foot of water at city-held auctions, but that energy companies are now buying the excess supplies for $1,200 to $2,900 per acre-foot.

NPR has a good post on Water Wars.

There are two doctrines that govern surface water rights in the U.S. — one for the West and one for the East.

'A Reasonable Right'

The riparian doctrine covers the East. "[Under] the riparian doctrine, if you live close to the river or to that water body [or] lake, you have reasonable rights to use that water," says Venki Uddameri, a professor and the director of water resources at Texas Tech University.

The Western U.S. uses the prior appropriation doctrine. "As people started exploring the West and started looking for water for agriculture and mining, there was a need to move water away from the rivers," Uddameri tells Jacki Lyden, host of weekends on All Things Considered.

People wanted a claim to water but often lived too far away from a river for the riparian doctrine to make any sense. So the prior appropriation doctrine was devised.

Uddameri explains: "It allocates rights based on who started using the water first. So if you are first in time, you are first in rights. And historically, it was based on a permitting process where you go and say you asked for the permit first, so you became the first user.

"But then there's been a shift saying not first use strictly based on who asked for the permit first, but who was actually there first," he says. "So the Indian tribes who were there first may not have asked for a permit, but there's recognition now that they were the first users of water, so they get that first appropriation."

Very few, but some of the smartest data center people look at the water rights for their data center.  Do you?

In an arid place like the Klamath Basin, there often isn't enough water available for everyone who has a right to use it. And the person with the oldest water right gets all the water they are entitled to first.

Asset Management is broken almost everywhere

A couple of years ago I went to an IT Asset Management conference and I thought I would learn how asset management worked in the industry.  What I quickly realized is something was amiss.  Almost all of these people were not even close to being technical.  Talking to some of the people and watching the presentations I eventually discovered that the vast majority of the approach was targeted at the person who eventually was given the job of asset management.  And, this person was given no tools and no budget with limited staff.  They had Excel spreadsheets type of databases and they didn't really understand the assets themselves and how they interact.  Very few system architects understand how the hardware and software works together, the task for a bookkeeper type of person to understand an asset is pretty much impossible.

I was reading Chris Crosby's post on gov't in action.

Good news. Three years into their five-year data center consolidation project the federal government just found out they had 3,000 more data centers than they originally thought. Boy, you just can’t slide anything past these guys. As a side note, this should make all of you worried about the NSA tracking your phone calls feel better, since if it takes 3 years to find 3,000 buildings, odds are they aren’t going to find out about your weekly communiqués to Aunt Marge–or even “Uncle” Bookie—any time soon. Obviously, this new discovery is going to have some impact on the project, but I think the first question we have to ask is just where have these data centers been hiding all this time?

If you want more details on the Federal Gov'ts discovery.

The one group that is pointed out as on top of the inventory is DHS.

The Department of Homeland Security is one of the few agencies that seems to have a handle on its data center inventory. Powner and Coburn praised DHS as the gold standard for data center consolidation because the agency has successfully tracked its data center inventory, how many have been consolidated and how much has been saved by consolidating facilities.

The DHS is one of the newest and most funded gov't agencies.

In fiscal year 2011, DHS was allocated a budget of $98.8 billion and spent, net, $66.4 billion.

So there are groups who do have a handle on their data center inventory, but many have the benefit of big budgets and newer organizations than the rest.

There is something fundamentally wrong with the way asset management is done.  And once you see why it is broken, then you can start to figure out how to fix it.

An example of the wrong approach is the person who is charge of Asset Manager has a long list of items to track for an asset, creating a long form that almost no one understands including themselves. Many times that long form then is used in data entry, reports, and effects the performance of the database.  What is missing is an overall way to reconcile the errors entered into the system.  What kind of errors, a simple thing like location of the asset. Someone enters into the long form the location of an asset.  Data accepted, move on to the next boring data entry.  When errors are made, they accumulate as there a lack of regular reconciliation methods to catch errors.  Annual audits are performed, but the audits themselves have errors.

Most people only care about a very limited of information about an asset if at all.  Yet, filling out an asset management form can be one of the most tedious mind numbing tasks in IT.  Who wants to do that.  Oh, there is someone in finance, purchasing, or operations who has that responsibility.  Tell them they have the asset management job.

Being asset manager is a thankless job at most companies.  Yet how much could be done if you really knew with 100% accuracy of when assets where installed, brought on line, repaired, and how the physical presence affected the overall performance of the system.  Asset management is really important to a few companies.  Who?  I would say that Google understands it best as they are one of the youngest and have the most servers.  Also, since they build their own servers, they need to understand every asset.  They asset management the components.  Something only no one does.  Well Facebook manages the server components as well, because they have a Frank Frankovsky who is ex-Dell and understands logistics.

What both Google and Facebook understand is IT Asset Management is part of the overall logistics to deliver IT services.  How can you not know with near 100% accuracy what you have and how it performs?  

Lesson from Successful Brewery to Data Center Operations, Lean Principles

I was catching up with a data center operations executive drinking some good Black Raven Beer in a Deschutes Brewery Growler in Deschutes Imperial Pints.

NewImage

NewImageNewImage

My son and I visited the Deschutes brewery and on the wall is a plaque with the following Lean Principles.

Lean Principles

- Do our best and next time do it better

- Focus on processes and systems as well as results

- Problems are opportunities

- Standardization allows creativity

- All work leads to damn tasty beer

Lean Rules

- Don’t take yourself too seriously

I don't think I've seen a sign like this in a data center, but I have seen beer. :-)

Google's Server Environment is not as homogenous as you think, up to 5 microarchitectures

There is a common belief that Google, Facebook, Twitter and any of the newer Web 2.0 companies have it easier because they have homogeneous environments vs. a typical enterprise.  Well, Google has a paper that discusses how its homogenous Warehouse-scale computers are actually heterogenous and there is opportunity for performance improvements of up to 15%.

In this table Google lists the number of micro architectures in 10 different data centers.  Now Google has 13 WSCs so this could show how old this analysis was run (maybe 2-3 yrs ago.)  Or it could have been more recently and they dropped 3 data centers out of the table.  The 13th just came on line over the past year and would probably not have enough data.  

NewImage

The issue that is pointed out in the paper is that the job manager assumes the cores are homogenous.

NewImage

When in fact they are not.

NewImage

Here is the results summary.

Results Summary: This paper shows that there is a

significant performance opportunity when taking advantage

of emergent heterogeneity in modern WSCs. At the scale of

modern cloud infrastructures such as those used by companies

like Google, Apple, and Microsoft, gaining just 1% of

performance improvement for a single application translates

to millions of dollars saved. In this work, we show that largescale

web-service applications that are sensitive to emergent

heterogeneity improve by more than 80% when employing

Whare-Map over heterogeneity-oblivious mapping. When

evaluating Whare-Map using our testbed composed of key

Google applications running on three types of production

machines commonly found co-existing in the same WSC, we

improve the overall performance of an entire WSC by 18%.

We also find a similar improvement of 15% in our benchmark

testbed and in our analysis of production data from WSCs

hosting live services.

Here are three different microarchitectures used in the paper - Table 3 is production. Table 4 is a test bed.

NewImage

Here are the range in performance for the three different micro architectures.

NewImage

The new job scheduler is deployed at Google and here are results.

NewImage

Figure 11 shows the calculated

performance improvement when using Whare-Map over the

currently deployed mapping in 10 of Google’s active WSCs.

Even though some major applications are already mapped

to their best platforms through manual assignment, we have

measured significant potential improvement of up to 15%

when intelligently placing the remaining jobs. This performance

opportunity calculation based on this paper is now

an integral part of Google’s WSC monitoring infrastructure.

Each day the number of ‘wasted cycles’ due to inefficiently

mapping jobs to the WSC is calculated and reported across

each of Google’s WSCs world wide.

There is more in the paper I need to digest, but I need to finish this post as it is long enough already.