Danger, Yahoo Mail is having the T-Mobile Sidekick Experience that sunk the service

If you hang around the hot things in the technology it is easy to believe that email is dead.  I don’t know about you, but e-mail is part of how I communicate.  Many young people have dropped their e-mail accounts as their friends use social media.  Yahoo is finding out how important mail is with days of outages that appear there is no end in sight.

This event has the possibility of being as big a disaster as Microsoft’s Danger T-Mobile sidekick outage/data loss that caused users to drop the service.

October 2009 data loss[edit]

In early October 2009, a server malfunction or technician error at Danger's data centers resulted in the loss of all Sidekick user data. As Sidekicks store users' data on Danger's servers—versus using local storage—users lost contact directories, calendars, photos, and all other media not locally backed up. Local backup could be accomplished through an app ($9.99 USD) which synchronized contacts, calendar, and tasks, but not notes, between the web and a local Windows PC. In an October 10 letter to subscribers, Microsoft expressed its doubt that any data would be recovered.[6]

The customer's data that was lost was being hosted in Microsoft's data centers at the time.[7] Some media reports have suggested that Microsoft hired Hitachi to perform an upgrade to its storage area network(SAN), when something went wrong, resulting in data destruction.[8] Microsoft did not have an active backup of the data and it had to be restored from a month-old copy of the server data, totalling 800GB in size, from offsite backup tapes. The entire restoration of data took over 2 months for customer data and full functionality to be restored.[9]

The Danger/Sidekick episode is one in a series of cloud computing mishaps that have raised questions about the reliability of such offerings.[10]

When you look at what is one of the causes of a major outage you will eventually trace to operations.  The initial Yahoo mail outage was caused be a hardware failure.  Marissa Mayer has posted the latest as of 5p today.

The initial failure was in a storage system.

On Monday, December 9th at 10:27 p.m. PT, our network operating center alerted the Mail engineering team to a specific hardware outage in one of our storage systems serving 1% of our users. The Mail team immediately started working with the storage engineers to restore access and move to our back-up systems, estimating that full recovery would be complete by 1:30 p.m. PT on Tuesday.

So, Yahoo fixes the problem, but restoring service is not simple as users are affected in a wide range.

However, the problem was a particularly rare one, and the resolution for the affected accounts was nuanced since different users were impacted in different ways. Some of the affected users were unable to access their accounts, instead seeing an outdated “scheduled maintenance” page which was a confusing and incorrect message (this has since been corrected and updated). Further, messages sent to those accounts during this time were not delivered, but held in a queue.

Now the service is running unless you use IMAP.  What is IMAP?  It is the way many mail clients mobile and desktop download mail, but it is not as easy as POP.

While IMAP remedies many of the shortcomings of POP, this inherently introduces additional complexity. Much of this complexity (e.g. multiple clients accessing the same mailbox at the same time) is compensated for by server-side workarounds such as Maildir or database backends.

The IMAP specification has been criticised for being insufficiently strict and allowing behaviours that effectively negate its usefulness. For instance, the specification states that each message stored on the server has a "unique id" to allow the clients to identify the messages they have already seen between sessions. However, the specification also allows these UIDs to be invalidated with no restrictions, practically defeating their purpose.[13]

Unless the mail storage and searching algorithms on the server are carefully implemented, a client can potentially consume large amounts of server resources when searching massive mailboxes.

Users don’t care about these details on IMAP.  Marissa closes her status with the following.  Will that make the users who don’t have mail through IMAP feel better?

Above all else, we’re going to be working hard on improvements to prevent issues like this in the future. While our overall uptime is well above 99.9%, even accounting for this incident, we really let you down this week.

We can, and we will, do better in the future.

It’s still not clear what is going to happen to those users email accessible through IMAP.