Won't be blogging much this week, focused on listening, learning and networking

I am at GigaOm Structure and I find if it is really hard to listen, learn, network and blog at the same time.  I can time shift the blogging to later, so I am going to focus on listening to the presentations, networking like crazy, and learning as much as I can.

Here is a sample of what is covered at GigaOm Structure.

See inside Facebook’s network & explore Google’s data dreams at Structure

 

JUN. 17, 2013 - 6:00 AM PDT

No Comments

SUMMARY:

Infrastructure nerds, it’s time to meet the accountants. At this year’s Structure conference this Wednesday and Thursday we’re focusing on the economics of cloud computing, not just for vendors, but for practitioners.

Want to understand how Facebook connects its servers? Hear from VMware’s CEO how the virtualization giant plans to build its next big business? Discover why Snapchat builds on Google App Engine as opposed to Amazon Web Services? Or maybe you want to understand if Microsoft can compete in the cloud.

Google publishes ideas discussing Good Enough approach to achieve low latency

It can be really hard to get the media to publish complex concepts which is why companies will submit their own articles.  Google's Luiz Barroso and Jeff Dean have an article on Google's Data Center challenge to provide low latency performance at scale.


The Tail at Scale

 


 





Systems that respond to user actions quickly (within 100ms) feel more fluid and natural to users than those that take longer.3Improvements in Internet connectivity and the rise of warehouse-scale computing systems2 have enabled Web services that provide fluid responsiveness while consulting multi-terabyte datasets spanning thousands of servers; for example, the Google search system updates query results interactively as the user types, predicting the most likely query based on the prefix typed so far, performing the search and showing the results within a few tens of milliseconds. Emerging augmented-reality devices (such as the Google Glass prototype7) will need associated Web services with even greater responsiveness in order to guarantee seamless interactivity.

The article can be long for most and here are two key points.

In large information-retrieval (IR) systems, speed is more than a performance metric; it is a key quality metric, as returning good results quickly is better than returning the best results slowly. Two techniques apply to such systems, as well as other to systems that inherently deal with imprecise results:

Good enough. In large IR systems, once a sufficient fraction of all the leaf servers has responded, the user may be best served by being given slightly incomplete ("good-enough") results in exchange for better end-to-end latency. The chance that a particular leaf server has the best result for the query is less than one in 1,000 queries, odds further reduced by replicating the most important documents in the corpus into multiple leaf servers. Since waiting for exceedingly slow servers might stretch service latency to unacceptable levels, Google's IR systems are tuned to occasionally respond with good-enough results when an acceptable fraction of the overall corpus has been searched, while being careful to ensure good-enough results remain rare. In general, good-enough schemes are also used to skip nonessential subsystems to improve responsiveness; for example, results from ads or spelling-correction systems are easily skipped for Web searches if they do not respond in time.

Google has used a technique like sticking your toe in the water to test out an environment before jumping.  They call it a canary request.

Canary requests. Another problem that can occur in systems with very high fan-out is that a particular request exercises an untested code path, causing crashes or extremely long delays on thousands of servers simultaneously. To prevent such correlated crash scenarios, some of Google's IR systems employ a technique called "canary requests"; rather than initially send a request to thousands of leaf servers, a root server sends it first to one or two leaf servers. The remaining servers are only queried if the root gets a successful response from the canary in a reasonable period of time. If the server crashes or hangs while the canary request is outstanding, the system flags the request as potentially dangerous and prevents further execution by not sending it to the remaining leaf servers. Canary requests provide a measure of robustness to back-ends in the face of difficult-to-predict programming errors, as well as malicious denial-of-service attacks.

The canary-request phase adds only a small amount of overall latency because the system must wait for only a single server to respond, producing much less variability than if it had to wait for all servers to respond for large fan-out requests; compare the first and last rows in Table 1. Despite the slight increase in latency caused by canary requests, such requests tend to be used for every request in all of Google's large fan-out search systems due to the additional safety they provide.

Do you care more about Top Supercomputers in China and NSA or Massive Clusters at Google, Facebook, Microsoft, and Amazon

There is news that China has the world's record for Supercomputer.

The ten fastest supercomputers on the planet, in pictures

Chinese supercomputer clocks in at 33.86 petaflops to break speed record.

A Chinese supercomputer known as Tianhe-2 was today named the world's fastest machine, nearly doubling the previous speed record with its performance of 33.86 petaflops. Tianhe-2's ascendance was revealed in advance and was made official today with the release of the new Top 500 supercomputer list.

The media will gladly write about who has the biggest and most powerful supercomputer.

As one of my friends who has worked on supercomputer data centers said, we realized we could reduce a lot costs in the data center, because the super computer would often have weekly maintenance intervals as well as monthly and quarterly.  Components are constantly failing and yes there is a degree of isolation in the failures, but you need to eventually repair the failures which can mean a complete shut down.  During these shut downs is when data center maintenance can be performed.

But, at Google, Facebook, Microsoft, and Amazon there is no time to shut down services.  100,000s of servers need to run all the time.  

Amazon threw up a supercomputer entry in 2011, and it is still ranked 127.

ListRankSystemVendorTotal CoresRmax (TFlops)Rpeak (TFlops)Power (kW)
06/2013 127 Amazon EC2 Cluster, Xeon 8C 2.60GHz, 10G Ethernet Self-made 17,024 240.1 354.1  
11/2012 102 Amazon EC2 Cluster, Xeon 8C 2.60GHz, 10G Ethernet Self-made 17,024 240.1 354.1  
06/2012 72 Amazon EC2 Cluster, Xeon 8C 2.60GHz, 10G Ethernet Self-made 17,024 240.1 354.1  
11/2011 42 Amazon EC2 Cluster, Xeon 8C 2.60GHz, 10G Ethernet Self-made 17,024 240.1 354.1

Can you imagine if Google, Facebook, Microsoft, or Amazon put up their clusters as an entry?

Part of companies like Google has as advantage is they have teams of people led by guys like Jeff Dean to really think hard about compute clusters.  Here is a presentation Dean gave 4 years ago.

NewImage

NewImage

Google, Facebook, Microsoft, and Amazon are solving the problem to keep supercomputer performance running 24x7x365 a year.  I think this type of innovation affects us much more than who has the fastest supercomputer which requires hundreds of hours of downtime for maintenance.  

Watch out water shortages are getting worse, fracking is bidding for the water

Droughts are scattered around and typically agriculture gets first priority.  But, as RT.com reports Fracking is causing problems.

Fracking is occurring in several counties in Arkansas, Colorado, New Mexico, Oklahoma, Texas, Utah and Wyoming, which are currently suffering a severe drought, the Associated Press reports. Although the procedure requires less water than farming or overall residential uses, it contributes to the depletion of an already-scare resource.

Some oil and gas companies manage to drain states of their water supply without spending any money, by depleting underground aquifers or rivers. But when unable to acquire the resource for free, the corporations can purchase large quantities at hefty prices. 

“There is a new player for water, which is oil and gas,” Colorado farmer Kent Peppler told AP, noting that he is fallowing some of his corn fields because he can’t afford to irritate them. “And certainly they are in a position to pay a whole lot more than we are.”

Peppler, president of the Rocky Mountain Farmers Union, said that the price of water has skyrocketed since oil companies have moved in. The Meade, Colo. Farmer said he used to pay $9 to $100 per acre-foot of water at city-held auctions, but that energy companies are now buying the excess supplies for $1,200 to $2,900 per acre-foot.

NPR has a good post on Water Wars.

There are two doctrines that govern surface water rights in the U.S. — one for the West and one for the East.

'A Reasonable Right'

The riparian doctrine covers the East. "[Under] the riparian doctrine, if you live close to the river or to that water body [or] lake, you have reasonable rights to use that water," says Venki Uddameri, a professor and the director of water resources at Texas Tech University.

The Western U.S. uses the prior appropriation doctrine. "As people started exploring the West and started looking for water for agriculture and mining, there was a need to move water away from the rivers," Uddameri tells Jacki Lyden, host of weekends on All Things Considered.

People wanted a claim to water but often lived too far away from a river for the riparian doctrine to make any sense. So the prior appropriation doctrine was devised.

Uddameri explains: "It allocates rights based on who started using the water first. So if you are first in time, you are first in rights. And historically, it was based on a permitting process where you go and say you asked for the permit first, so you became the first user.

"But then there's been a shift saying not first use strictly based on who asked for the permit first, but who was actually there first," he says. "So the Indian tribes who were there first may not have asked for a permit, but there's recognition now that they were the first users of water, so they get that first appropriation."

Very few, but some of the smartest data center people look at the water rights for their data center.  Do you?

In an arid place like the Klamath Basin, there often isn't enough water available for everyone who has a right to use it. And the person with the oldest water right gets all the water they are entitled to first.

Asset Management is broken almost everywhere

A couple of years ago I went to an IT Asset Management conference and I thought I would learn how asset management worked in the industry.  What I quickly realized is something was amiss.  Almost all of these people were not even close to being technical.  Talking to some of the people and watching the presentations I eventually discovered that the vast majority of the approach was targeted at the person who eventually was given the job of asset management.  And, this person was given no tools and no budget with limited staff.  They had Excel spreadsheets type of databases and they didn't really understand the assets themselves and how they interact.  Very few system architects understand how the hardware and software works together, the task for a bookkeeper type of person to understand an asset is pretty much impossible.

I was reading Chris Crosby's post on gov't in action.

Good news. Three years into their five-year data center consolidation project the federal government just found out they had 3,000 more data centers than they originally thought. Boy, you just can’t slide anything past these guys. As a side note, this should make all of you worried about the NSA tracking your phone calls feel better, since if it takes 3 years to find 3,000 buildings, odds are they aren’t going to find out about your weekly communiqués to Aunt Marge–or even “Uncle” Bookie—any time soon. Obviously, this new discovery is going to have some impact on the project, but I think the first question we have to ask is just where have these data centers been hiding all this time?

If you want more details on the Federal Gov'ts discovery.

The one group that is pointed out as on top of the inventory is DHS.

The Department of Homeland Security is one of the few agencies that seems to have a handle on its data center inventory. Powner and Coburn praised DHS as the gold standard for data center consolidation because the agency has successfully tracked its data center inventory, how many have been consolidated and how much has been saved by consolidating facilities.

The DHS is one of the newest and most funded gov't agencies.

In fiscal year 2011, DHS was allocated a budget of $98.8 billion and spent, net, $66.4 billion.

So there are groups who do have a handle on their data center inventory, but many have the benefit of big budgets and newer organizations than the rest.

There is something fundamentally wrong with the way asset management is done.  And once you see why it is broken, then you can start to figure out how to fix it.

An example of the wrong approach is the person who is charge of Asset Manager has a long list of items to track for an asset, creating a long form that almost no one understands including themselves. Many times that long form then is used in data entry, reports, and effects the performance of the database.  What is missing is an overall way to reconcile the errors entered into the system.  What kind of errors, a simple thing like location of the asset. Someone enters into the long form the location of an asset.  Data accepted, move on to the next boring data entry.  When errors are made, they accumulate as there a lack of regular reconciliation methods to catch errors.  Annual audits are performed, but the audits themselves have errors.

Most people only care about a very limited of information about an asset if at all.  Yet, filling out an asset management form can be one of the most tedious mind numbing tasks in IT.  Who wants to do that.  Oh, there is someone in finance, purchasing, or operations who has that responsibility.  Tell them they have the asset management job.

Being asset manager is a thankless job at most companies.  Yet how much could be done if you really knew with 100% accuracy of when assets where installed, brought on line, repaired, and how the physical presence affected the overall performance of the system.  Asset management is really important to a few companies.  Who?  I would say that Google understands it best as they are one of the youngest and have the most servers.  Also, since they build their own servers, they need to understand every asset.  They asset management the components.  Something only no one does.  Well Facebook manages the server components as well, because they have a Frank Frankovsky who is ex-Dell and understands logistics.

What both Google and Facebook understand is IT Asset Management is part of the overall logistics to deliver IT services.  How can you not know with near 100% accuracy what you have and how it performs?