Google had a crazy idea a year ago, let's share some of our cluster data to the research community. In Jan 2010, Google shared 7 hrs of data.
Google faces a large number of technical challenges in the evolution of its applications and infrastructure. In particular, as we increase the size of our compute clusters and scale the work that they process, many issues arise in how to schedule the diversity of work that runs on Google systems.
We have distilled these challenges into the following research topics that we feel are interesting to the academic community and important to Google:
- Workload characterizations: How can we characterize Google workloads in a way that readily generates synthetic work that is representative of production workloads so that we can run stand alone benchmarks?
- Predictive models of workload characteristics: What is normal and what is abnormal workload? Are there "signals" that can indicate problems in a time-frame that is possible for automated and/or manual responses?
- New algorithms for machine assignment: How can we assign tasks to machines so that we make best use of machine resources, avoid excess resource contention on machines, and manage power efficiently?
- Scalable management of cell work: How should we design the future cell management system to efficiently visualize work in cells, to aid in problem determination, and to provide automation of management tasks?
Now Google has shared 29 days from 11,000 Servers in a Google Cluster.
Google has a strong interest in promoting high quality systems research, and we believe that providing information about real-life workloads to the academic community can help.
In support of this we published a small (7-hour) sample of resource-usage information from a Google production cluster in 2010 (research blog on Google Cluster Data). Approximately a dozen researchers at UC Berkeley, CMU, Brown, NCSU, and elsewhere have made use of it.
Recently, we released a larger dataset. It covers a longer period of time (29 days) for a larger cell (about 11k machines) and includes significantly more information, including:
- the original resource requests, to permit scheduling experiments
- request constraints and machine attriibutes
- machine availability and failure events
- some of the reasons for task exits
- (obfuscated) job and job-submitter names, to help identify repeated or related jobs
- more types of usage information
- CPI (cycles per instruction) and memory traffic for some of the machines
Besides the feedback from the the research community, this is a great way for Google to find future hires.