Robotics in the data center

DatacenterKnowledge has a guest post by an MTM Technologies consultant on robotics in the data center.

Robotics in the Data Center

As the reliance on the data center continues to grow, full software and hardware robotics automation is no longer a question of if, but a matter of when, technologists predict. Robotics organizations, like Chicago-based DevLinks LTD are already having conversations and creating initial designs for data center robotics automation.

About 4 years ago I started playing around with the idea of robotics in the data center and 2 years ago I spent a good 6 months diving into subject.  Some of the design concepts I figured out would not be intuitive for someone who doesn't understand the way data centers are built and the economics.   

One example of an issue is mentioned in the DCK article on the ability to go up in height.

Grow vertically instead of just horizontally. Robotics allows the data center to be extremely efficient with space. After all, robotics will allow us to reach higher and go much further than we’ve been ever able to go. The ability to scale upwards allows data centers to create new designs utilizing floor space much more efficiently.

Anyone who has built data centers and operated them knows a good rule is to have 100-150 watts/sq ft of IT white space.  Any denser increases costs and increases the probability of stranding power.  Going higher is going to cause more heat problems as well as the top of rack equipment could be 2-3 degrees warmer.

It is nice to see the robotics in data center ideas is getting more attention.  When I was talking about the idea two years ago people thought I was really out there. :-)  I was going to write more on the topic, but only threw up two posts.  /gdcblog/category/robotics.  Well, this is the third.

Microsoft's Xbox One announces 300k servers, trouble for Sony and Nintendo

The TV console wars are between Microsoft, Sony and Nintendo.  Sony had a serious outage that brought down its service.

The PlayStation Network outage was the result of an "external intrusion" on Sony's PlayStation Network and Qriocity services, in which personal details from approximately 77 million accounts were stolen and prevented users of PlayStation 3 and PlayStation Portable consoles from playing online through the service.[1][2][3][4] The attack occurred between April 17 and April 19, 2011,[1] forcing Sony to turn off the PlayStation Network on April 20. On May 4 Sony confirmed that personally identifiable information from each of the 77 million accounts appeared to have been stolen.[5] The outage lasted 24 days

I would guess anyone who has worked in Sony's online services group has a bad taste in their mouth and it is hard to get more resources.  

Xbox One launched this past week and Xbox Live is a big part of the services.

DatacenterKnowledge mentions that Microsoft will have 300,000 servers as part of Xbox One.  Running a Google Search, Bing search didn't show up the source of the information as a Microsoft transcript of the event.

When we launched Xbox Live in 2002, it was powered by 500 servers. With the advent of the 360, that number had grown to over 3,000. Today, 15,000 servers power the modern Xbox Live experience. But this year, we will have more than 300,000 severs for Xbox One, more than the entire world's computing power in 1999. (Cheers, applause.)

This matches the DCK article.

“When we launched Xbox Live in 2002, it was powered by 500 servers,” Microsoft’s Marc Whitten said in introducing the new platform. “With the advent of the 360, that had grown to over 3,000. Today, 15,000 servers power the modern Xbox Live experience. But this year, we will have more than 300,000 servers for Xbox One.”

Curious I wanted to see what was actually said, so I found the Xbox One Launch event on Youtube and at 23:13 mark is where the Xbox server reference is made.  And thanks to Youtube transcript here is the text.

23:13
when we launch xbox live in two thousand two it was powered by five hundred
23:16
servers
23:17
with the advent of the three sixty that number had grown to over three thousand
23:22
today
23:22
fifteen thousand servers power the modern xbox live experience
23:27
but this year
23:29
we'll have more than three hundred thousand servers for xbox one
23:33
more than the entire world computing power in nineteen ninety nine
 

NewImage

Part of what I do for some clients is provide research services and it is important to get to the original source of information and show where the public disclosures were.  Thanks to YouTube and other online services it is so much easier to get to the source of information which is transforming how news is reported and how analysis can be done.

People just don't get it, Green is important, employees (the talent) care

I read this post by IDG's James Niccolai.  I know James so I was curious what he writes on the Green topic at Uptime Symposium.

Datacentres show signs of 'green fatigue'

Success stories from cutting-edge firms such as Google and Microsoft are causing a backlash at less capable data centers

A new survey from the Uptime Institute suggests fatigue is setting in when it comes to making datacentres greener, and it may be partly due to overachievers like Google and Microsoft.

 

 

James continues with input from Matt Stansberry.

"A lot of these green initiatives, like raising server inlet temperatures and installing variable-speed fans, are seen as somewhat risky, and they're not something you do unless you have a bunch of engineers on staff," he said.

But there may be other factors at work. Stansberry suspects that managers at smaller datacentres are simply fed up hearing about success stories from bleeding-edge technology companies such as Google, and their survey responses may reflect frustration at their inability to keep up.

"I don't really think that half the datacentres in the US aren't focused on energy efficiency, I think they're just sick of hearing about it," he said. "You've got all these big companies with brilliant engineers and scads of money, and then there's some guy with a bunch of old hardware sitting around thinking, What the hell am I supposed to do?"

I just spent a week in NC and was talking to lots of folks in the data center industry. And you know I think everyone who I spent more than 30 seconds chatting to had a concern for the environment.  So, what is different between the people I was chatting to and the people filling out an Uptime survey? Well, they know I care about the environment, so that biases our conversations given I care.  I wonder if people think Uptime Institute cares about the environment?  

What I did find for a bit of perspective was a post by Mike Manos from 2008 that Uptime's Ken Brill has branded Microsoft and Google as enemy of the traditional data center operators.

I was personally greatly disappointed with the news coming out of last week that the Uptime Institute had branded Microsoft and Google as the enemy to traditional data center operators.  To be truthful, I did not give the reports much credit especially given our long and successful relationship with that organization.  However, when our representatives to the event returned and corroborated the story, I have to admit that I felt more than  a bit let down.

As reported elsewhere, there are some discrepancies in how our mission was portrayed versus the reality of our position.

...

The comments that Microsoft and Google are the biggest threat to the IT industry and that Microsoft is “making the industry look bad by putting our facilities in areas that would bring the PUE numbers down” are very interesting.

Here is Harvard Business Review blog post on sustainability fitting in the battle for talent.

Sustainability Matters in the Battle for Talent

Employees at semiconductor-chip-maker Intel recently devised a new chemistry process that reduced chemical waste by 900,000 gallons, saving $45 million annually. Another team developed a plan to reuse and optimize networking systems in offices, which cut energy costs by $22 million.

The projects produced financial and environmental benefits, of course. But just as valuable is the company's ability to energize and empower front-line employees. New data shows that sustainability is an increasingly important factor in attracting and managing talent.

This point reminds me of one piece of data I have on the hunt for talent.  When Olivier Sanche was recruiting he once asked me guess how many people have read my post on his going to Apple and caring about the environment? I don't know. 25%?  Everyone.  Every person says they read your post and they expect me to save the polar bears and they want to work for a person who has passion for the environment.  

Look who are some of the most vocal green data center companies - Google, Apple, Facebook, and Microsoft.  You can argue who is the best out of these 4.  These 4 are investing in greener initiatives on a regular basis.  

Green fatigue could be caused by misguided efforts more than whether green is important.

If you were trying to lose weight and get in better shape, would you give up on the diet and exercise that didn't work.  Sure you're frustrated, but that doesn't mean you should give up.

Part of the argument is it takes a lot of money to do these green things, but even Google has shown how it has used very little money to improve its operations. Here is Google's paper on its improvements in a POP room.

Introduction

Every year, Google saves millions of dollars and avoids emitting tens of thousands of tons of . carbon dioxide thanks to our data center sustainability efforts. In fact, our facilities use half the energy of a typical data center. This case study is intended to show you how you can apply some . of the cost-saving measures we employ at Google to your own data centers and networking rooms. At Google, we run many large proprietary data centers, but we also maintain several smaller networking rooms, called POPs or “Points of Presence”. POPs are similar to millions of small and medium-sized data centers around the world. This case study describes the retrofit of one of . these smaller rooms, describing best practices and simple changes that you can make to save thousands of dollars each year. For this retrofit, Google spent a total of $25,000 to optimize this room’s airflow and reduce . air conditioner use. A $25,000 investment in plastic curtains, air return extensions, and a new . air conditioner controller returned a savings of $67,000/year. This retrofit was performed without any operational downtime.

Lack of Redundancy in Bridge Design causes I-5 Outage

It is amazing how there can single points of failure in data centers even though they were sold as highly available designs.  Some make the mistake that just because it hasn't failed in the past and a lot of money was paid, failure is unlikely.

My dad was a civil engineer with CalTrans (California's Department of Transportation) Bridge Division which includes overpasses (CA has way more overpasses than bridges), so whenever I read about civil engineering stuff it reminds of a possible conversation with my Dad.  Unfortunately, my dad died of colon cancer 19 years ago, so I need imagine the conversations.

In the state of Washington the I-5 has an outage, a bridge has collapsed when a truck's load hit the structure.

 

What is the cause of the bridge collapse, an outage of Interstate I-5 between Seattle and Vancouver, BC?  One hit from a truck and it collapses?  Sounds like a Jenga design.  Knock out this one block and the whole thing falls.

Here is a view from Google maps of what the bridge used to look like before the collapse.

NewImage

The WSJ has an answer to the outage.  A single point of failure.  The lack of redundancy in the design. 

"This is not the sign of deteriorating infrastructure, this is a sign of vulnerable infrastructure," said Abolhassan Astaneh-Asl, a civil-engineering professor at the University of California, Berkeley.

"This original design in those days was fine," he said of bridges lacking redundancy, "but today we should invest in getting these…out of the system."

...

The bridge has what is known as a "fracture-critical" design, which means that if any part fails, the whole bridge could fail, said Mr. Astaneh-Asl. "A fracture critical bridge is like a chain," he said. "Any link in this chain you cut, it's going to fail."

Data Center Capacity Infrastructure Pattern

The concept of software patterns is well established.  

In software engineering, a design pattern is a general reusable solution to a commonly occurring problem within a given context in software design. A design pattern is not a finished design that can be transformed directly into source or machine code. It is a description or template for how to solve a problem that can be used in many different situations. Patterns are formalized best practices that the programmer must implement themselves in the application.

Over 9 years ago I tried to figure out infrastructure patterns in the same way that software patterns are used.  About 2-3 years ago, I finally understood how to develop infrastructure patterns, and it has taken me the additional 2-3 years to test some of the ideas to get to point of writing them up.

The following is going to be a riff of ideas, and I'll most likely clean it up with the help of some of my friends who are good at writing up patterns, so read the following as a rough draft.

I was talking to a software guy who now works in a data center deploying some of the more complex IT equipment.  His background is software so he gets patterns. We had a brief conversation the other day and I explained the following pattern of site design and capacity.  One of the most important things in defining a pattern is to identify what problem you want to solve.

The problem I am going to discuss is how to add data center capacity in a region like a city.

The typical method is to identify the current use.  Let's say there is a need for 100kW in a facility in a city.  The team who acquires capacity knows how difficult it can be to add space and add to an existing cage, so they decide to quadruple the requirement and look for 400kW.  To start they'll use 25% of the capacity and grow into it over a 10 year period.  They set up the lease to have one fee for reserving the capacity and another set of fees for actual use.

The flaw in using this method is there is an assumption that the space needs to contiguous in one cage area and it is a requirement to have contiguous space.  Logical from a real estate perspective.

Proposed method:  Pick a unit of power that is the most cost effective in a facility given the power infrastructure.  Let's say 140kW.  Enough to handle the 100kW requirement and 40% head room.  Fear is the business could rapidly need more space.  The key to picking this first space is it should have high connectivity to other spaces in the building (not necessarily adjacent) and other buildings that can support the growth of the company. As the business out grows the original 140kW, the data center group has identified other candidates for space to add for growth.  The strategy is to have at first two spaces that are on different power, cooling and network infrastructure, then continue to add more in a mesh of 3-5 sites.  The trade-off of adding smaller units of expansion that can be fully loaded and optimized forces an isolation of compute that can be useful.

For example,  by the time you get to the 4th unit it is highly likely the 1st unit is in need for a hardware refresh across most of the IT gear.  As you power up the 4th unit, you can be working on decommissioning the 1st site, complete replacing the gear to support the future growth.  If you had one contiguous space it is highly likely the 1st deployments are so intertwined with the next 3 years of deployments, the upgrade process is extremely complex.  If each unit of expansion is meant to be isolating in a mesh, then the dependencies are reduced and easier to take offline.

Issues: it is over simplistic to treat data centers as if it is office space that needs to be in one building and adjacent floors.  Can you imagine if the corporate real estate group let the office groups be on the 3rd fl, 8 fl, and 15 fl of a building, and the other team in another building 1/2 mile away?  But, guess what with the right network infrastructure bits going from floor to floor, or to another building is not an issue.  

Examples;  When you look at Google, Facebook, and Microsoft's data centers they build additional buildings to add capacity to a site.  They did not build a building 4 times bigger than what they needed and grow into it over 10 years.  Modular data centers by Dell, HP, and Compass data centers allow those who feel they need to have buildings to use this same approach.  Once you jump off the rack top of rack switch it can make little difference whether you are going 5 ft, 500 ft, 5,000 ft, or 50,000 ft.