Case study of Washington Post using AWS

Jeff Bezos bought Washington Post, and there is speculation Washington Post would use technology.  Here is a case study that was released a while ago on the Washington Post using AWS.

AWS Case Study: Washington Post

Peter Harkins, a Senior Engineer at The Washington Post, heard the news spread through the editorial department as the National Archives announced the release of Hillary Clinton’s official White House schedule. The data was going to be released to the public on March 19th at 10am. 17,481 pages of data as a non-searchable PDF.
Washington Post

The documents included Hillary Clinton’s daily activities as a First Lady during President Bill Clinton’s two terms in office, from 1993-2001 that were being made public under the Freedom of Information Act after multiple requests from journalists and watchdog organizations.

Harkins knew that reporters would be very interested in this data but it would take hundreds of man hours to pore through the document’s low-quality PDF files. So, about 45 minutes after the release, Harkins started working with the data, trying to find a way to convert the images into usable, searchable text and deliver them to the newsroom within the same news cycle.

Harkins first tested various PDF and Optical Character Recognition (OCR) tools to convert the images into machine-readable text. With these software tools, he estimated that it would take about 30 minutes per page to process the sizable document including reformatting, resizing, and scanning each page.

Working against time, Harkins moved the project to the cloud—Amazon Elastic Compute Cloud (Amazon EC2). With Amazon EC2, he launched 200 server instances to process the images to his specifications. With a processing speed of approximately 60 seconds per page, the project was completed within nine hours and sent to the eager writers who began searching against the data. Then, Harkins and team created a polished web interface and made their searchable database available to the public 26 hours later.

Harkins ruminates, “EC2 made it possible for this project to happen at the speed of breaking news. I used 1,407 hours of virtual machine time for a final expense of $144.62. We consider it a successful proof of concept.”