Do you have a bug from hell running in your data center?

Before working in data centers I worked on operating systems at Apple and Microsoft.  Creating software and working on data centers are kind of a yin and yang - soft and hard, white and black, etc.

In Chinese philosophy, the concept of yin-yang (simplified Chinese阴阳traditional Chinese陰陽pinyinyīnyáng), which is often called "yin and yang", is used to describe how seemingly opposite or contrary forces are interconnected and interdependent in the natural world; and, how they give rise to each other as they interrelate to one another. Many natural dualities (such as male and female, light and dark, high and low, hot and cold, water and fire, life and death, and so on) are thought of as physical manifestations of the yin-yang concept.

Never thought about it until now that the hard part of IT is it is yin and yang.  Software and hardware.  bits and physical buildings.  web services and real physical infrastructure.  a SW engineer and a mechanical engineer.

Getting everything to work in a data center just right can be frustrating as things sometimes don't work exactly the way they are supposed.  In all that physical infrastructure there are software are bugs from hell.  Bugs that are so nasty and nerve wracking it will make you want to pull your hair out.  Some of these nastiest bugs exist at the transition from light and dark like yin and yang.  Here is a description of bugs from hell.

BugFromHell is any bug where several hours or more of time is spent by a veteran developer attempting to track-down (and fix) the cause of a software bug. By definition, any bug that takes this long to find is almost always the result of a side-effect of the problematic code (otherwise, the problem would be readily visible via typical debug tools--e.g., stack trace, stepping through code in debug mode, etc). A BugFromHell is very elusive and is typically cannot be isolated or consistently reproduced.

  • Hours? Nah, a true BFH is one that takes weeks to find. (Especially in embedded systems work, when "it's a hardware problem" is always a possibility).

The effects of a BugFromHell typically appear anywhere except near the problematic code. Such a bug will write to random part of memory, flip bits that aren't detected for a long period of running time, or appear to happen randomly without appearing to have been triggered by anything; or, worse, appear to be affected by the act of observing it (a HeisenBug).

In the example that the author uses you can see many of these bugs from hell exist at the interface between software and hardware.

Examples:

  • overwriting part of the stack frame
  • writing to a memory location that has been moved or deleted (and is now occupied by a different object)
  • using an uninitialized variable that ultimately leads to writing to a random memory location
  • an unforeseen interaction between two threads or processes that only has a very small chance of occurring
  • thread interaction that won't happen running on a single CPU box, but which manifests on multiple CPUs
  • assumptions made by developers of one webbrowser that aren't made by any other. (You'll always have a <title> tag when setting the charset.)
  • Hardware drivers that aren't sufficiently paranoid / robust.
  • JMPing into an unprotected NULL, or into some other executable gibberish.
  • returning from a function with an unbalanced stack (primarily when embedding assembly code, for embedded systems).

Bugs from hell are running in every data center and are so frustrating.  

Why did I write this post, because my SW dev lead has been in three weeks of bug from hell working full time to fix.  Ouch three weeks of unexpected productivity sapped by a bug so nasty it was elusive yet extremely damaging.

Google's Emerging Market plans leaked, WSJ covers wireless efforts

Back in Sept 2012 Google announced its fits data center build in LATAM.  Making the jump from NAP of Americas to South America and being in co-location sites can only work for a limited audience.  At some point you'll need MWs of data center space.  

I have long said to my clients that there is an world wide race to provide sub 100ms latency to everyone in the world.  Google is a player and Equinix.  Digital Realty Trust is building out wholesale space.  Carriers are building relationships and capabilities to span the world.  Netflix is expanding in emerging markets which drives demand for AWS globally as well.

The WSJ covers Google's efforts in Africa and Middle East and Southeast Asia.

Google to Fund, Develop Wireless Networks in Emerging Markets

Google Inc. GOOG -1.07% is deep into a multipronged effort to build and help run wireless networks in emerging markets as part of a plan to connect a billion or more new people to the Internet.

These wireless networks would serve areas such as sub-Saharan Africa and Southeast Asia to dwellers outside of major cities where wired Internet connections aren't available, said people familiar with the strategy.

The networks also could be used to improve Internet speeds in urban centers, these people said.

Google plans to team up with local telecommunications firms and equipment providers in the emerging markets to develop the networks, as well as create business models to support them, these people said. It is unclear whether Google already has lined up such deals or alliances.

One of the areas I have been watching for is when servers will show up in cell tower installations to improve the performance and latency of mobile devices.  With Google's acquisition of Motorola they can create a wireless data center solution.  And there is even speculation Google will launch an airborne wireless fleet.

As part of the plan, Google has been working on building an ecosystem of new microprocessors and low-cost smartphones powered by its Android mobile operating system to connect to the wireless networks, these people said. And the Internet search giant has worked on making special balloons or blimps, known as high-altitude platforms, to transmit signals to an area of hundreds of square miles, though such a network would involve frequencies other than the TV broadcast ones.

Google has also considered helping to create a satellite-based network, some of these people said.

Some people may think this is new news, but there have been discussions even back in 2007 that Google was looking at wireless networks.

Sometimes the rumours are both outrageous and true. Google is experimenting with new ways of bringing broadband connections to consumers, by blanketing parts of Silicon Valley with Wi-Fi networks. It is planning to enter an auction for valuable radio spectrum in America, and thinking of radically new business models to make money from wireless data and voice networks, perhaps a free service supported by ads.

Who has the data on the social media wars? uh, search engines

Social Media is the big battlefield.  News.com has an article on Twitter's strength of hashtags vs. Facebook.

#Hashtags: Facebook's missing link to pop culture

The # symbol has become the key to connecting to people and events you care about on social media. It's also an obvious hole for Facebook.

Is this insight really meaningful?  And can you make a business decision on this?  Are adding #hashtags the answer?

Who knows better what to do in social media?  A bunch of analysts and media reporters? Or?

Who has the data?  Ohhhhhh, Google does.  One thing I have noticed running this blog site is seeing the amount of robots, search engines that hit my site.

Google and other search engines - Baidu, Bing, Yandex all have the RAW data to understand the social media wars - what works and what doesn't.

The battle for Mobile and Social are like wars where data is key to define strategies.  But, in any battle the winners are not those who have the most amount of information.  If that was true then the CIA with all its analysts and computer systems should have been able to win every war.  Too much data can create a problem of analysis.

The problem with data is it shows the past, not necessarily the future.  Yet, some people will stand on the piles of data and use it to justify their position of analysis.

Sometimes the winner in the war is the one who takes a different strategy that the data doesn't support.

Ender's game is finally coming to the screen at the end of year and is a science fiction classic.  Did Ender win because he had more data, or he saw things from a different perspective than others.

Robotics in the data center

DatacenterKnowledge has a guest post by an MTM Technologies consultant on robotics in the data center.

Robotics in the Data Center

As the reliance on the data center continues to grow, full software and hardware robotics automation is no longer a question of if, but a matter of when, technologists predict. Robotics organizations, like Chicago-based DevLinks LTD are already having conversations and creating initial designs for data center robotics automation.

About 4 years ago I started playing around with the idea of robotics in the data center and 2 years ago I spent a good 6 months diving into subject.  Some of the design concepts I figured out would not be intuitive for someone who doesn't understand the way data centers are built and the economics.   

One example of an issue is mentioned in the DCK article on the ability to go up in height.

Grow vertically instead of just horizontally. Robotics allows the data center to be extremely efficient with space. After all, robotics will allow us to reach higher and go much further than we’ve been ever able to go. The ability to scale upwards allows data centers to create new designs utilizing floor space much more efficiently.

Anyone who has built data centers and operated them knows a good rule is to have 100-150 watts/sq ft of IT white space.  Any denser increases costs and increases the probability of stranding power.  Going higher is going to cause more heat problems as well as the top of rack equipment could be 2-3 degrees warmer.

It is nice to see the robotics in data center ideas is getting more attention.  When I was talking about the idea two years ago people thought I was really out there. :-)  I was going to write more on the topic, but only threw up two posts.  /gdcblog/category/robotics.  Well, this is the third.

Microsoft's Xbox One announces 300k servers, trouble for Sony and Nintendo

The TV console wars are between Microsoft, Sony and Nintendo.  Sony had a serious outage that brought down its service.

The PlayStation Network outage was the result of an "external intrusion" on Sony's PlayStation Network and Qriocity services, in which personal details from approximately 77 million accounts were stolen and prevented users of PlayStation 3 and PlayStation Portable consoles from playing online through the service.[1][2][3][4] The attack occurred between April 17 and April 19, 2011,[1] forcing Sony to turn off the PlayStation Network on April 20. On May 4 Sony confirmed that personally identifiable information from each of the 77 million accounts appeared to have been stolen.[5] The outage lasted 24 days

I would guess anyone who has worked in Sony's online services group has a bad taste in their mouth and it is hard to get more resources.  

Xbox One launched this past week and Xbox Live is a big part of the services.

DatacenterKnowledge mentions that Microsoft will have 300,000 servers as part of Xbox One.  Running a Google Search, Bing search didn't show up the source of the information as a Microsoft transcript of the event.

When we launched Xbox Live in 2002, it was powered by 500 servers. With the advent of the 360, that number had grown to over 3,000. Today, 15,000 servers power the modern Xbox Live experience. But this year, we will have more than 300,000 severs for Xbox One, more than the entire world's computing power in 1999. (Cheers, applause.)

This matches the DCK article.

“When we launched Xbox Live in 2002, it was powered by 500 servers,” Microsoft’s Marc Whitten said in introducing the new platform. “With the advent of the 360, that had grown to over 3,000. Today, 15,000 servers power the modern Xbox Live experience. But this year, we will have more than 300,000 servers for Xbox One.”

Curious I wanted to see what was actually said, so I found the Xbox One Launch event on Youtube and at 23:13 mark is where the Xbox server reference is made.  And thanks to Youtube transcript here is the text.

23:13
when we launch xbox live in two thousand two it was powered by five hundred
23:16
servers
23:17
with the advent of the three sixty that number had grown to over three thousand
23:22
today
23:22
fifteen thousand servers power the modern xbox live experience
23:27
but this year
23:29
we'll have more than three hundred thousand servers for xbox one
23:33
more than the entire world computing power in nineteen ninety nine
 

NewImage

Part of what I do for some clients is provide research services and it is important to get to the original source of information and show where the public disclosures were.  Thanks to YouTube and other online services it is so much easier to get to the source of information which is transforming how news is reported and how analysis can be done.