Google Ads

Enter your email address:

Delivered by FeedBurner

This form does not yet contain any fields.
    « Zynga's IPO not so hot, we'll see how Zynga's data center build out goes in 2012 | Main | Big Data, Hadoop, Dell, and Splunk, where is the connection? »

    Loggly suffers extended outage after AWS reboot shuts down their service

    Loggly a cloud service  that provides as one of its services System Monitoring and Alerting.

    Systems Monitoring & Alerting

    Alerting on log events has never been so easy.  Alert Birds will help you eliminate problems before they start by allowing you to monitor for specific events and errors.  Create a better user experience and improve customer satisfaction through proactive monitoring and troubleshooting. Alert Birds are available to squawk & chirp when things go awry.

    But, Loggly has suffered an extended outage that was caused by AWS rebooting 100% of their servers, but that was only half the time down.  The other half was due to not knowing the service was down.

    Loggly's Outage for December 19th

    Posted 19 Dec, 2011 by Kord Campbell

    Sometimes there's just no other way to say  "we're down" than just admitting you screwed up and are down.  We're coming back up now, and in theory by the time this is read, we'll be serving the app again normally.  There will be a good amount of time until we can rebuild the indexes for historic data of our paid customers. This is our largest outage to date, and I'm not at all proud of it.


    Loggly uses a variety of monitoring mechanisms to ensure our services are healthy.  These include, but are not limited to, extensive monitoring with Nagios, external monitors like Zerigo, and using a slew of our own API calls for monitoring for errors in our logs.  When the mass reboot occurred we failed to alert because a) our monitoring server was rebooted and failed to complete the boot cycle, b) the external monitors were only set to test for pings and established connections to syslog and http (more about that in a moment), and c) the custom API calls using us were no longer running because we were down.

    Combined, these failures effectively  prevented us from noticing we were down.  This in of itself is was the cause of at least half our down time, and to me, the most unacceptable part of this whole situation.

    The other half of the outage was caused by Loggly not testing for a 100% reboot of all machines.

    The Human Element

    The other cause to our failures is what some of you on Twitter are calling "a failure to architect for the cloud".  I would refine that a bit to say "a failure to architect for a bunch of guys randomly rebooting 100% of your boxes".  A reboot of all boxes has never been tested at Loggly before.  It's a test we've failed completely as of today.  We've been told by Amazon they actually had to work hard at rebooting a few of our instances, and one scrappy little box actually survived their reboot wrath.

    One of the lessons that Loggly learned that some of my SW buddies and I are using in a SW design is to add more than one monitoring solution.

    The second step is to ensure more robust external monitoring.  With multiple deployments, this issue becomes less of an issue, but clearly we need more reliable checks than what we rely on with Zerigo or other services.  Sorry, but simple HTTP checks, pings and established connections to a box do not guarantee it's up!



    PrintView Printer Friendly Version

    EmailEmail Article to Friend

    References (1)

    References allow you to track sources for this article, as well as articles that were written in response to this article.
    • Response
      Loggly suffers extended outage after AWS reboot shuts down their service - Green (Low Carbon) Data Center Blog - Green Data Center Blog

    PostPost a New Comment

    Enter your information below to add a new comment.

    My response is on my own website »
    Author Email (optional):
    Author URL (optional):
    Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>