How Facebook Ships Code, hints of how their data centers work

My wife and I just watched The Social Network on DVD.  We have a 9 and 6 year old so going out to movies together is rare.

The Social Network Trailer

My wife worked in sales for many companies like IDG, Ziff Davis with clients like Intel, Palm, and Microsoft.  She’s seen the SW developers, but never was really been exposed to their world.  Watching The Social Network was entertaining and it is Hollywood spin on the SW culture.  She was amazed at how the SW developers were portrayed, and their focus on writing code.  The bit of irony is working on the Apple Mac OS team and Microsoft Windows team, I could recognize behaviors that reminded of the days when I was much younger and would work in the same type of mode as Facebook was portrayed. The reality not the Hollywood version.

So what is it like in Facebook Development?  Her is a post on How Facebook Ships Code.

How Facebook Ships Code

January 17, 2011 — yeeguy

I’m fascinated by the way Facebook operates.  It’s a very unique environment, not easily replicated (nor would their system work for all companies, even if they tried).  These are notes gathered from talking with many friends at Facebook about how the company develops and releases software.

Seems like others are also interested in Facebook…   The company’s developer-driven culture is coming under greater public scrutiny and other companies are grappling with if/how to implement developer-driven culture.   The company is pretty secretive about its internal processes, though.  Facebook’s Engineering team releases public Notes on new features and some internal systems, but these are mostly “what” kinds of articles, not “how”…  So it’s not easy for outsiders to see how Facebook is able to innovate and optimize their service so much more effectively than other companies.  In my own attempt as an outsider to understand more about how Facebook operates, I assembled these observations over a period of months.  Out of respect for the privacy of my sources, I’ve removed all names and mention of specific features/products.  And I’ve also waited for over six months to publish these notes, so they’re surely a bit out-of-date.   I hope that releasing these notes will help shed some light on how Facebook has managed to push decision-making “down” in its organization without descending into chaos…  It’s hard to argue with Facebook’s results or the coherence of Facebook’s product offerings.  I think and hope that many consumer internet companies can learn from Facebook’s example.

I have had friends over the last 2 years interview at Facebook, most turn down working at Facebook as they were being recruited for senior engineering manager positions and they couldn’t see how they could do their job and be successful.

The post has lots of information, and here are parts that gives you an idea of how Facebook thinks about its SW which influences hardware and data centers.

engineers generally want to work on infrastructure, scalability and “hard problems” — that’s where all the prestige is.  can be hard to get engineers excited about working on front-end projects and user interfaces.  this is the opposite of what you find in some consumer businesses where everyone wants to work on stuff that customers touch so you can point to a particular user experience and say “I built that.”  At facebook, the back-end stuff like news feed algorithms, ad-targeting algorithms, memcache optimizations, etc. are the juicy projects that engineers want.

Note the above reference can be implied to Apple (consumer business).

Additional information that backs up a focus on infrastructure.

  • as of June 2010, the company has nearly 2000 employees, up from roughly 1100 employees 10 months ago.  Nearly doubling staff in under a year!
  • the two largest teams are Engineering and Ops, with roughly 400-500 team members each.  Between the two they make up about 50% of the company.

More details are explained here in a process for releases.

  • by default all code commits get packaged into weekly releases (tuesdays)
  • with extra effort, changes can go out same day
  • tuesday code releases require all engineers who committed code in that week’s release candidate to be on-site
  • engineers must be present in a specific IRC channel for “roll call” before the release begins or else suffer a public “shaming”
  • ops team runs code releases by gradually rolling code out
    • facebook has around 60,000 servers
    • there are 9 concentric levels for rolling out new code
    • [CORRECTION thx epriest] “The nine push phases are not concentric. There are three concentric phases (p1 = internal release, p2 = small external release, p3 = full external release). The other six phases are auxiliary tiers like our internal tools, video upload hosts, etc.”
    • the smallest level is only 6 servers
    • e.g., new tuesday release is rolled out to 6 servers (level 1), ops team then observes those 6 servers and make sure that they are behaving correctly before rolling forward to the next level.
    • if a release is causing any issues (e.g., throwing errors, etc.) then push is halted.  the engineer who committed the offending changeset is paged to fix the problem.  and then the release starts over again at level 1.
    • so a release may go thru levels repeatedly:  1-2-3-fix. back to 1. 1-2-3-4-5-fix.  back to 1.  1-2-3-4-5-6-7-8-9.
  • ops team is really well-trained, well-respected, and very business-aware.  their server metrics go beyond the usual error logs, load & memory utilization stats — also include user behavior.  E.g., if a new release changes the percentage of users who engage with Facebook features, the ops team will see that in their metrics and may stop a release for that reason so they can investigate.
  • during the release process, ops team uses an IRC-based paging system that can ping individual engineers via Facebook, email, IRC, IM, and SMS if needed to get their attention.  not responding to ops team results in public shaming.
  • once code has rolled out to level 9 and is stable, then done with weekly push.
  • if a feature doesn’t get coded in time for a particular weekly push, it’s not that big a deal (unless there are hard external dependencies) — features will just generally get shipped whenever they’re completed.
  • getting svn-blamed, publicly shamed, or slipping projects too often will result in an engineer getting fired.  ”it’s a very high performance culture”.  people that aren’t productive or aren’t super talented really stick out.  Managers will literally take poor performers aside within 6 months of hiring and say “this just isn’t working out, you’re not a good culture fit”.  this actually applies at every level of the company, even C-level and VP-level hires have been quickly dismissed if they aren’t super productive.
  • [CORRECTION, thx epriest“People do not get called out for introducing bugs. They only get called out if they ask for changes to go out with the release but aren’t around to support them in case something goes wrong (and haven’t found someone to cover for you).”

Note the 60,000 server count is not accurate base on my research and is at least twice with another 1/3 growth in the short term before Prineville DC comes on line.

Would you want to be a senior executive hired for this environment?  Now you can see why a lot of my friends turned down jobs. 

Read more

Asia Data Center Alliance commits to Green Data Centers

Asia Pacific is one of the big growth markets.  Equinix and Digital Realty Trust are expanding in the these markets with partnerships.  Those left out of these partnerships could try to compete individually or create their own partnerships like the Asia Data Center Alliance (ADCA).  Which has a commitment to Green Data Centers.

This Data Center Alliance sets the precedence for ASEAN data center service providers to join forces & collaborate in enabling region's enterprises to stay competitive in the global economy. The other objective of ADCA is to develop Green Data Centers by applying modest energy saving technologies as the contribution for reducing the greenhouse effect and global warming.

The ADCA comprises.

ADCA founding members include:

  • 1-Net Singapore Pte Ltd (Singapore)
  • CMC Telecommunication Services Corp. (Vietnam)
  • The AIMS Asia Group Sdn Bhd (Malaysia)
  • T.C.C. Technology Co., Ltd (Thailand)

ADCA associated members includes:

  • HKCOLO Limited (Hong Kong)

The combined space is 450,000 sq ft. 

The Asia Data Center Alliance (ADCA) has a combined space

of more than 45,000m2. We strive to be one of the biggest data

It is too bad they follow the traditional approach in quoting space and not power. 

Read more

New NSA “spy” data center adding up the staff

DefenseSystems.com discusses the new NSA “spy” data center.

Work commences on $1B NSA 'spy' center

Cyber intelligence data center reportedly will support the Comprehensive National Cybersecurity Initiative

The U.S. Army Corps of Engineers broke ground this week on a massive new National Security Agency cyber intelligence center in Utah. Located at Camp Williams, 25 miles south of Salt Lake City, the $1.2 billion facility — officially known as the Utah Data Center — will be responsible for collecting and aggregating incoming intelligence data.

According to USACE, the center will have 100,000 square feet of raised-floor data center space and more than 900,000 square feet of technical support and administrative space. Support facilities will include an electrical substation, a vehicle inspection facility and visitor control center, fuel storage, water storage, and a chiller plant. Camp Williams is a National Guard training site operated by the Utah National Guard.

Salt Lake City news, Desert News also writes about the employment #’s.

Utah's $1.5 billion cyber-security center under way

Published: Thursday, Jan. 6, 2011 1:10 a.m. MST

By Steve Fidel, Deseret News

45 comments

FacebookTwitterE-MAIL | PRINT | FONT + -

CAMP WILLIAMS — Thursday's groundbreaking for a $1.5 billion National Security Agency data center is being billed as important in the short term for construction jobs and important in the long term for Utah's reputation as a technology center.

"This will bring 5,000 to 10,000 new jobs during its construction and development phase," Sen. Orrin Hatch, R-Utah, said on Wednesday. "Once completed, it will support 100 to 200 permanent high-paid employees."

See all 5 photos | Click to enlarge

The U.S. Army Corps of Engineers and National Security Agency host a joint groundbreaking ceremony for the first Intelligence Community Comprehensive National Cyber-security Initiative (CNCI) Data Center Thursday, Jan. 6, 2011, at Camp Williams. Construction of the $1.2 billion Data Center is scheduled to be completed in October 2013.

Stuart Johnson, Deseret News

The U.S. Army Corps of Engineers and National Security Agency host a joint groundbreaking ceremony for the first Intelligence Community Comprehensive National Cyber-security Initiative (CNCI) Data Center Thursday, Jan. 6, 2011, at Camp Williams. Construction of the $1.2 billion Data Center is scheduled to be completed in October 2013.

But, something doesn’t add up.  100 to 200 staff to support 100,000 sq ft of NSA type of white space is plausible.  But, 900,000 sq ft for technical support and administrative space means there is 4,500 – 9,000 sq ft per employee.

What data center do you know of has 1/10 dedicated to white space and 9/10 for support?

Something doesn’t add up in terms of what is in the NSA “spy” data center unless there are a lot more people there.

Read more

Olivier Sanche Memorial Service Jan 28, 2011 in Los Gatos, CA

It’s been almost 2 months since Olivier Sanche voice went silent and we’ll never hear him again, speaking with passion on data centers and the environment. 

A Memorial Service will be held on Jan 28, 2011 at 4p in Los Gatos, CA.

Here are details.

Dear all,


Some of you might not be aware of this terrible news so it is with great sorrow that I must inform you that Olivier passed away on November 26, 2010 in Europe from a sudden heart attack.


The funeral took place on December 3rd in Pignan ( his hometown), France.
There will be a memorial in his honor on Friday, January 28, 2011 at St. Mary's Catholic Church, 219 Bean Avenue in Los Gatos ( California) at 4PM.
Here is the link for directions.

The memorial will be held after school hours so there should be some parking available in the church's parking lot, otherwise the town of Los Gatos has several free public parking lots (along University Street ) as well as street parking.

Sincerely,
Karine Sanche

Please feel free to send me stories you have about Olivier as I’ll be helping to pull together a perspective on his awesome past. 

-Dave Ohara

dave@greenm.com

Read more

Increasing Energy Efficient Server Competition, ARM efforts ramps up with Virtualization Support

James Hamilton has a good post on the state of ARM powered servers.  Sometimes when I read other people’s work I think what they say in the end should be moved to the beginning of the conversation.  Here is the last paragraph from James’s post.

We are on track for renewed competition in the server-side computing market segment and intense competition on power efficiency at the same time as internet-scale service operators are willing to run whatever processor is least expensive and most power efficient. With competition comes innovation and I see a good year coming.

James points out the ARM instruction set as an advantage.

ARM has become an incredibly important instruction set architecture powering smartphones, low-end network routers, printers, copiers, tablets, and other embedded applications. But things are changing, arm is now producing designs appropriate for server-side computing at the same time that power consumption is becoming a key measure of server-side computing cost. The ARM design team are masters of low power designs and generations of ARMs have focused on power management. ARM has an impressively efficiently design.

Here is another little known fact is Virtualization Extensions in the ARM architecture.

Virtualization Extensions

  • The ARM Architecture Virtualization Extension and Large Physical Address Extension (LPAE) enable the efficient implementation of virtual machine hypervisors for ARM architecture compliant processors.
  • Connected consumer devices and cloud computing demand energy efficient, high performance systems to handle complex software with potentially large amounts of data.
  • The ARM Architecture Virtualization Extensions provides the basis for ARM Architecture compliant processors to address the needs of both client and server devices for the partitioning and management of complex software environments into virtual machines.
  • The ARM Architecture Large Physical Address Extension provides the means for each of the software environments to efficiently utilize the available physical memory when handling large amounts of data.

Part of the Virtualization extensions is more than 32 bit virtual memory addressing.

  • As the complexity of software increases the requirement for multiple software environments to be available on the same physical processor increases simultaneously. Software applications that require separation for reasons of isolation, robustness or differing real-time characteristics need a virtual processor exhibiting the required functionality.
  • To provide virtual processors in an energy-efficient manner requires a combination of hardware acceleration and efficient software hypervisors. The ARM Architecture Virtualization Extension standardizes the architecture for implementation of the hardware acceleration in ARM application processor cores, while high performance hypervisors from the world’s leading virtualization companies provide the software component upon which to build effective software combinations.
  • Cloud computing and other data or content oriented solutions increase the demands on the physical memory system from each virtual machine. The large physical address extensions provide a second level of MMU translation table so that each 32-bit virtual memory address can be mapped within a 40-bit physical memory range. This allows systems to allocate sufficient physical memory to each virtual machine for efficient throughput to be maintained when total demands on memory exceed the range of 32-bit addressing.

People will laugh at ARM servers, but when you can get a dozen or more for the price of one Xeon, there are scenarios where ARM servers will work.

Read more