Google Ads

Enter your email address:

Delivered by FeedBurner

This form does not yet contain any fields.

    15 years ago Google placed its largest server order and did something big starting site reliability engineering

    Google’s   posted on Google placing its largest server order in its history 15 years ago.


    Shared publicly  -  11:41 AM
    15 years ago we placed the largest server offer in our history: 1680 servers, packed into the now infamous "corkboard" racks that packed four small motherboards onto a single tray. (You can see some preserved racks at Google in Building 43, at the Computer History Museum in Mountain View, and at the American Museum of Natural History in DC,  

    At the time of the order, we had a grand total of 112 servers so 1680 was a huge step.  But by the summer, these racks were running search for millions of users.  In retrospect the design of the racks wasn't optimized for reliability and serviceability, but given that we only had two weeks to design them, and not much money to spend, things worked out fine.

    I read this thinking how impactful was this large server order.  Couldn’t figure what I would post on how the order is significant.

    Then I ran into this post on Site Reliability Engineering dated Apr 28, 2014, and realized there was a huge impact by Google starting the idea of a site reliability engineering team.


    Here is one the insights shared.


    The solution that we have in SRE -- and it's worked extremely well -- is an error budget.  An error budget stems from this basic observation: 100% is the wrong reliability target for basically everything.  Perhaps a pacemaker is a good exception!  But, in general, for any software service or system you can think of, 100% is not the right reliability target because no user can tell the difference between a system being 100% available and, let's say, 99.999% available.  Because typically there are so many other things that sit in between the user and the software service that you're running that the marginal difference is lost in the noise of everything else that can go wrong.
    If 100% is the wrong reliability target for a system, what, then, is the right reliability target for the system?  I propose that's a product question. It's not a technical question at all.  It's a question of what will the users be happy with, given how much they're paying, whether it's direct or indirect, and what their alternatives are.
    The business or the product must establish what the availability target is for the system. Once you've done that, one minus the availability target is what we call the error budget; if it's 99.99% available, that means that it's 0.01% unavailable.  Now we are allowed to have .01% unavailability and this is a budget.  We can spend it on anything we want, as long as we don't overspend it.  

    Here is another rule that is good to think about when running operations.

    One of the things we measure in the quarterly service reviews (discussed earlier), is what the environment of the SREs is like. Regardless of what they say, how happy they are, whether they like their development counterparts and so on, the key thing is to actually measure where their time is going. This is important for two reasons. One, because you want to detect as soon as possible when teams have gotten to the point where they're spending most of their time on operations work. You have to stop it at that point and correct it, because every Google service is growing, and, typically, they are all growing faster than the head count is growing. So anything that scales headcount linearly with the size of the service will fail. If you're spending most of your time on operations, that situation does not self-correct! You eventually get the crisis where you're now spending all of your time on operations and it's still not enough, and then the service either goes down or has another major problem.


    Building the Best Software Services, can you find the secret guild?

    I have been the bay area for the past two weeks for business meetings before I head back to Redmond.  Actually haven’t been here for two weeks straight, taking two trips.  I’ve lived for 22 years in Redmond, and before that spent 32 years in Silicon Valley.  I go back and forth often enough that I have an office space in both locations.  How Silicon Valley works is different than Seattle/Redmond, but there is a common trait.  The guys who belong to the secret guild of low level programmers who can build services that scale and run like an energizer bunny.  Working on OS at Apple and Microsoft got me used to working with the developers who belong to the secret guild.

    What is the secret guild?  Here is a post that tells the story.

    the secret guild of silicon valley

    The governors of the guild of St. Luke, Jan de Bray

    A couple of weeks ago, I was drinking beer in San Francisco with friends when someone quipped:

    "You have too many hipsters, you won’t scale like that. Hire some fat guys who know C++." 

    It’s funny, but it got me thinking.  Who are the “fat guys who know C++”, or as someone else put it, “the guys with neckbeards, who keep Google’s servers running”? And why is it that if you encounter one, it’s like pulling on a thread, and they all seem to know each other?

    The reason is because the top engineers in Silicon Valley, whether they realize it or not, are part of a secret Guild.  They are a confraternity of craftsmen who share a set of traits:


    Read the post to get the rest of story.

    For those of you too lazy to click on the link, here is the closing paragraphs.

    Finally, the implicit compact that the Guild makes with a company is that their efforts will not be in vain.  The most powerfully attractive force for the Guild is the promise of building a product that will get into the happy hands of hundreds, thousands, or millions.  This is the coveted currency that even companies that have struggled to build an engineering reputation, like foursquare, can offer. 

    The Guild of Silicon Valley is largely invisible, but their affiliations have determined the rise and fall of technology giants.  The start-ups who recognize the unsung talents of its members today will be tomorrow’s success stories.



    Turbines Blade Pressure causes Bat's Trauma, not impact

    Telegraph has a post reporting that Bats are dying because of the turbine blade pressures, not impact.

    Bats get ‘the Bends’ when they fly too near wind turbines, experts have claimed.

    Queen’s University Belfast said pressure from the turbine blades causes a similar condition as that experienced by divers when the surface too quickly.

    Conservationists have warned that the bodies of bats are frequently seen around the bases of turbines, but it was previously assumed they had flown into the blades.

    However, Dr Richard Holland claims that bats suffer from ‘barotrauma’ when the approach the structures which can pop their lungs from inside their bodies.

    Suggested answer by Dr. Holland is to turn off the turbines during migration.

    Dr Holland said energy companies should consider turning off turbines when bats are migrating.

    "We know that bats must be 'seeing' the turbines, but it seems that the air pressure patterns around working turbines give the bats what's akin to the bends," he said.

    The effect on wildlife of wind turbines is slowly being discovered.

    Salon reports that offshore wind farms are helping seals find food.

    Go wind power! For once, the green energy source has made the news for the wildlife itdoesn’t inadvertently slaughter — and that it may even be helping to thrive. Offshore wind farms, finds a study published today in the journal Current Biology, are making more food available for seals.

    A farm off the coast of Germany, researchers found, is acting as an “artificial reef,” attracting fish and crustaceans and the grey and harbor seals that feed on them.


    Think Google Infrastructure will hit $3bil/Qtr in Q4 2014 or Q1 2015?

    Google’s data center group is on a growth curve that is mind blowing.  Last quarter was $2.65 Bil.


    Note not all of this spend is the data center group.

    When you stare at this graph it seems like the $3bil mark is only a few quarters away.

    If you are a believer in size brings efficiency, then Google is clearly one of the leaders.


    What you mean there are bogus repairs? Hell yeh

    WSJ had an article on bogus repairs on trains at port complex.

    TERMINAL ISLAND, Calif.—Ten thousand railcars a month roll into this sprawling port complex in Los Angeles County. While here, most are inspected by a subsidiary ofCaterpillar Inc. CAT +0.44%

    When problems are found, the company repairs the railcars and charges the owner. Inspection workers, to hear some tell it, face pressure to produce billable repair work.

    Some workers have resorted to smashing brake parts with hammers, gouging wheels with chisels or using chains to yank handles loose, according to current and former employees.

    In a practice called "green repairs," they added, workers at times have replaced parts that weren't broken and hid the old parts in their cars out of sight of auditors. One employee said he and others sometimes threw parts into the ocean.

    It is bit ironic that the term “green repairs” is used to describe the practicer.  What could be more non-green (environmental) than damaging a part to create a repair transaction. 

    Even so, they said, car men are under pressure to identify repair work to be done. The quickest way to do so, they said, was to smash something or to remove a bolt or other part and report it as missing.

    They weren't instructed to do that, the workers said. But they added that some managers made clear the workers would be replaced if they didn't produce enough repair revenue.

    "A lot of guys are in fear of losing their jobs because there's no work in California," said one worker, standing in front of his small ranch house a few miles from the Terminal Island ports.

    Car men are expected to justify their hourly pay "and then some," this worker said. "If you find no defects, it's a bad night," he added, and that creates a temptation to "break something that's not broken."

    This is a consequence of having performance based systems that are short sighted.