Simplicity and the Data Center, a path to a Happier Data Center?

October 25, 2009 Dave Ohara

One area I use to gauge how good a data center designer is whether they talk about simplicity in the data center. I can think of people at Google and Microsoft who regularly use simplicity as a design goal. And, there are many others. Why simplicity is important is articulated well in this post by Matthieu Ricard who discusses simplicity as applied to an approach to life, and for many companies data centers are their life. If data centers suffer, then the company suffers.

In praise of simplicity

Friday 27 March 2009

« Simplify, simplify, simplify… » These refreshing words written by Henry Thoreau remind us that much of our suffering comes from adding unnecessary and disturbing complications in our lives. We seem to be continually weaving elaborate conceptual webs around even straightforward events. We distort reality and shroud it with complications by superimposing fabricated mental constructs. This distortion invariably leads to mental states and behaviors that undermine our inner peace and that of others.

How many human enterprises and noble causes have failed due to such unnecessary complications! We need to simplify our thoughts, simplify our words, and simplify our actions. We need to avoid falling into circular mental rumination, pointless chatter, and vain activities that waste our precious time and engender all kinds of dysfunctional situations.

Having a simple mind is not the same as being simple-minded. Simplicity of mind is reflected in lucidity, inner strength, buoyancy, and a healthy contentment that withstands the tribulations of life with a light heart. Simplicity reveals the nature of the mind behind the veil of restless thoughts. It reduces the exacerbated feeling of self-importance and opens our heart to genuine altruism.

Who is Matthieu Ricard? A really smart guy who got his Ph.D. degree in cell genetics at the renowned Institut Pasteur under the Nobel Laureate Francois Jacob, but figured out he wanted to do more with his life and decided to be a buddhist monk, so he spends a lot of time thinking about ways to live a happier life. And, maybe there are things to learn from him on how there could be better data centers.

Since 1989, Matthieu has served as the French interpreter for the Dalai Lama. He is a board member of the Mind and Life Institute, an organization dedicated to collaborative research between scientists and Buddhist scholars and meditators. He is engaged in the research on the effect of mind training and meditation on the brain at various universities in the USA (Madison, Princeton, and Berkeley), Europe (Zurich) and Hong Kong.

For an entertaining talk watch this video. One of the funny parts is when he makes fun of fellow French and intellectuals at time mark 1:40.

He has figured out how to be the happiest person in the world.

He has been dubbed the "happiest person in the world" by scientists.^[2] Matthieu Ricard was a volunteer subject in the University of Wisconsin–Madison's testing of happiness, scoring -0.45 which was off the scale compared to hundreds of other volunteers, where scores ranged between +0.3 indicating depression and -0.3 denoting great happiness.^[3]

Another way to interpret the need for simplicity is the desire for cloud computing. This post by Joe McKendrick on ZDNET references material written for Database Trends and Applications.

Paradox 5: Complexity Increases Simplicity. “There is pressure on data centers to provide more services, scalability and availability than ever before. That’s why cloud computing approaches are gaining in popularity—companies can ramp up capabilities by hiding away the complexity. “We do not see the concept of the data center disappearing, instead, we see the concept of data centers becoming more amorphous,” says Martin Schneider, director of product marketing at SugarCRM. “The emerging trend of cloud computing kind of ties all of the major trends around data centers, in that it enables companies to run far simpler data centers, if not obviating the need for them in some instances.”

Do you think of simplicity in your data center design? Or are you one of those who believes adding another feature will solve your data center problems?

Maybe we need a happiness metric for data centers? I bet there are plenty of data centers we could add to the list of suffering data centers. How many are happy?

OpenSolaris Green Home Server – low power and small

October 25, 2009 Dave Ohara

Sun employee Constantin Gonzalez Schmitz has post on his technical decisions for a Green OpenSolaris Home server. His requirements for ECC memory and power efficient make sense to have a reliable low power server.

A Small and Energy-Efficient OpenSolaris Home Server

In an earlier entry, I outlined my most important requirements for an optimal OpenSolaris Home Server. It should:

Run OpenSolaris in order to fully leverage ZFS,

Support ECC memory, so data is protected at all times,

Be power-efficient, to help the environment and control costs,

Use a moderate amount of space and be quiet, for some extra WAF points.

He admits his wife works for AMD, but qualifies his decision for AMD processor based on price, performance, and energy efficiency.

Disclosure: My wife works for AMD, so I may be slightly biased. But I think the following points are still very valid.

…

AMD on the other hand has a number of attractive points for the home server builder:

AMD consumer CPUs use the same microarchitecture than their professional CPUs (currently, it's the K10 design). They only vary by number of cores, cache size, number of HT channels, TDP and frequency, which are all results of the manufacturing process. All other microarchitecture features are the same. When using an AMD consumer CPU, you essentially get a "smaller brother" of their high end CPUs.

This means you'll also get a built-in memory-controller that supports ECC.

This also means less chips to build a system (no Northbridge needed) and thus lower power-consumption.

AMD has been using the HyperTransport Interconnect for quite a while now. This is a fast, scaleable interconnect technology that has been on the market for quite a while so chipsets are widely available, proven and low-cost.

So it was no suprise that even low-cost AMD motherboards at EUR 60 or below are perfectly capable of supporting ECC memory which gives you an important server feature at economic cost.

My platform conclusion: Due to ECC support, low power consumption and good HyperTransport performance at low cost, AMD is an excellent platform for building a home server.

To keep things small he uses 2.5” drives.

While looking for alternatives, I found a nice solution: The Scythe Slot Rafter fits into an unused PCI slot (taking up the breadth of two) and provides space for mounting four 2.5" disks at just EUR 5. These disks are cheap, good enough and I had an unused one lying around anyway, so that was a perfect solution for me.

And, being concerned about reliability adds a 2nd NIC.

Extra NIC: The Asus M3A78-CM comes with a Realtek NIC and some people complained about driver issues with OpenSolaris. So I followed the advice on the aforementioned Email thread and bought an Intel NIC which is well supported, just in case.

Constantin was able to achieve a 45W idle power consumption.

The Result

And now for the most important part: How much power does the system consume? I did some testing with one boot disk and 4GB of ECC RAM and measured about 45W idle. While stressing CPU cores, RAM and the disk with multiple instances of sysbench, I could not get the system to consume more than 80W. All in all, I'm very pleased with the numbers, which are about half of what my old system used to consume. I didn't do any detailed performance tests yet, but I can say that the system feels very responsive and compile runs just rush along the screen. CPU temperature won't go beyond the low 50Cs on a hot day, despite using the lowest fan speed, so cooling seems to work well, too.

It will be interesting to see what follow up posts Constantin writes.

Google’s Secret to efficient Data Center design – ability to predict performance

October 23, 2009 Dave Ohara

DataCenterKnowledge has a post on Google’s (Public, NASDAQ:GOOG) future envisioning 10 million servers.

Google Envisions 10 Million Servers
October 20th, 2009 : Rich Miller
Google never says how many servers are running in its data centers. But a recent presentation by a Google engineer shows that the company is preparing to manage as many as 10 million servers in the future.

Google’s Jeff Dean was one of the keynote speakers at an ACM workshop on large-scale computing systems, and discussed some of the technical details of the company’s mighty infrastructure, which is spread across dozens of data centers around the world.

In his presentation (link via James Hamilton), Dean also discussed a new storage and computation system called Spanner, which will seek to automate management of Google services across multiple data centers. That includes automated allocation of resources across “entire fleets of machines.”

Going to Jeff Dean’s presentation, I found a Google secret.

Designs, Lessons and Advice from Building Large
Distributed Systems

Designing Efficient Systems
Given a basic problem definition, how do you choose the "best" solution?
• Best could be simplest, highest performance, easiest to extend, etc.
Important skill: ability to estimate performance of a system design
– without actually having to build it!

What is Google’s assumption of where computing is going?

Thinking like an information factory Google describes the machinery as servers, racks, and clusters. This approach supports the idea of information production. Google introduces the idea of data centers being like a computer, but I find a more accurate analogy is to think of data centers as information factories. IT equipment are the machines in the factory, consuming large amounts of electricity for power and cooling the IT load.

Located in a data center like Dalles, OR

With all that equipment things must break. And, yes they do.

Reliability & Availability
• Things will crash. Deal with it!
– Assume you could start with super reliable servers (MTBF of 30 years)
– Build computing system with 10 thousand of those
– Watch one fail per day
• Fault-tolerant software is inevitable
• Typical yearly flakiness metrics
– 1-5% of your disk drives will die
– Servers will crash at least twice (2-4% failure rate)

The Joys of Real Hardware
Typical first year for a new cluster:
~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
~1 network rewiring (rolling ~5% of machines down over 2-day span)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky (40-80 machines see 50% packetloss)
~8 network maintenances (4 might cause ~30-minute random connectivity losses)
~12 router reloads (takes out DNS and external vips for a couple minutes)
~3 router failures (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for dns
~1000 individual machine failures
~thousands of hard drive failures
slow disks, bad memory, misconfigured machines, flaky machines, etc.
Long distance links: wild dogs, sharks, dead horses, drunken hunters, etc.

Monitoring is how you know your estimates are correct.

Add Sufficient Monitoring/Status/Debugging Hooks
All our servers:
• Export HTML-based status pages for easy diagnosis
• Export a collection of key-value pairs via a standard interface
– monitoring systems periodically collect this from running servers
• RPC subsystem collects sample of all requests, all error requests, all
requests >0.0s, >0.05s, >0.1s, >0.5s, >1s, etc.
• Support low-overhead online profiling
– cpu profiling
– memory profiling
– lock contention profiling
If your system is slow or misbehaving, can you figure out why?

Many people have quoted the idea “you can’t manage what you don’t measure.” But a more advanced concept that Google discusses is “If you don’t know what’s going on, you can’t do
decent back-of-the-envelope calculations!”

Know Your Basic Building Blocks
Core language libraries, basic data structures,
protocol buffers, GFS, BigTable,
indexing systems, MySQL, MapReduce, …
Not just their interfaces, but understand their
implementations (at least at a high level)
If you don’t know what’s going on, you can’t do
decent back-of-the-envelope calculations!

This ideas being discussed are by a software architect, but the idea applies just as much to data center design. And, the benefit Google has it has all of IT and development thinking this way.

And here is another secret to great design. Say No to features. But what the data center design industry wants to do is to get you to say yes to everything, because it makes the data center building more expensive increasing profits.

So what is the big design problem Google is working on?

Jeff Dean did a great job of putting a lot of good ideas in his presentation, and it was nice Google let him present some secrets we could all learn from.

CA about to launch EcoSoftware for Green IT

October 21, 2009 Dave Ohara

CNET news reports CA will be releasing an EcoSoftware solution.

CA jumps into eco-software market

by Larry Dignan

CA next week will unveil an integrated sustainability suite designed to track carbon emissions, environmental assessments, metering, and compliance to policies in one dashboard.

CA calls the suite EcoSoftware and will launch it Monday, according to Christopher Thomas, vice president of energy and sustainability. I ran into Thomas at the Gartner IT Symposium, where the carbon-monitoring software caught my eye.

There are other efforts designed to track carbon emissions. For instance, Hara and SAP have various applications and others use metering to measure sustainability efforts.

Read more of "CA jumps into eco software market; Plans to launch carbon tracking suite" at ZDNet's Between the Lines.

I have written in the past it was natural for management tool vendors – Tivoli, OpenView, and CA to add Green IT management, so this is no surprise.

We’ll get more details next week as the launch is scheduled for Oct 26.

Watching a person read my blog – Gartner employee

October 20, 2009 Dave Ohara

I use typepad for my blog, and go through the site statistics to see what people are reading checking out what google search words they use to find my entries. This gives me an idea of what people are searching for and who my readers are. The following I am pretty sure is a Gartner employee.

This morning at 9:30p PST I wrote a blog post on Gartner’s recommendation for Pattern-based strategy. At 2:45 and 2:48 pm I got the following hits.

The first was google search for “gartner advanced analytics.” My one entry 5 hours earlier is result #6, beating out NetworkWorld.

The 2nd one is a google search for “reshaping the data center gartner”. I have the #1 search result, behind google news, but beating gartner’s blog, CIO.com, and news.cnet.com.

I was visiting a friend who works at Google yesterday and we chatted briefly how well my blog works with Google search, but I swear I have no insider information.

All I know is keep on writing, and keep on looking at my results.

Thanks for reading my blog.