I am sure most of your have heard of Hadoop.
I've started studying Hadoop and its adoption in data centers. Google started the effort with its MapReduce and Google File System.
Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google's MapReduce and Google File System (GFS) papers.
Why should you care about Hadoop? Look at who the users are - Amazon Web Services, Adobe, AOL, Baidu, eBay, Facebook, Google, Hulu, IBM, LinkedIn, Quantcast, Rackspace, Twitter, and Yahoo.
Yahoo! is proud of being the largest Hadoop user. Here is their 2009 #'s 25,000 nodes.
And, 2010 38,000 servers for 170 PB of storage
Apache Pig is a platform for analyzing the large data set.
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties: