CMS releases open data for Machine Learning

Jul 17, 2019

The CMS collaboration at CERN has released its fourth batch of open data to the public. With this release, which brings the volume of its open data to more than 2 PB (or two million GB), CMS has now provided open access to 100% of its research data recorded in proton–proton collisions in 2010, in line with the collaboration’s data-release policy. The release also includes several new data and simulation samples. The new release builds upon and expands the scope of the successful use of CMS open data in research and in education.

In this release, CMS open data address the ever-growing application of machine learning (ML) to challenges in high-energy physics. According to a recent paper, collaboration with the data-science and ML community is considered a high-priority to help advance the application of state-of-the-art algorithms in particle physics. CMS has therefore also made available samples that can help foster such collaboration.

“Modern machine learning is having a transformative impact on collider physics, from event reconstruction and detector simulation to searches for new physics,” remarks Jesse Thaler, an Associate Professor at MIT, who is working on ML using CMS open data with two doctoral students, Patrick Komiske and Eric Metodiev. “The performance of machine-learning techniques, however, is directly tied to the quality of the underlying training data. With the extra information provided in the latest data release from CMS, outside users can now investigate novel strategies on fully realistic samples, which will likely lead to exciting advances in collider data analysis.”

The ML datasets, derived from millions of CMS simulation events for previous and future runs of the Large Hadron Collider, focus on solving a number of representative challenges for particle identification, tracking and distinguishing between multiple collisions that occur in each crossing of proton bunches. All the datasets come with extensive documentation on what they contain, how to use them and how to reproduce them with modified content.

In its policy on data preservation and open access, CMS commits to releasing 100% of its analysable data within ten years of collecting them. Around half of proton-proton collision data collected at 7 TeV center-of-mass in 2010 were released in the first CMS release in 2014, and the remaining data are included in this new release. In addition, a small sample of unprocessed raw data from LHC’s Run 1 (2010 to 2012) are also released. These samples will help test the chain for processing CMS data using the legacy software environment.

Reconstructed data and simulations from the CASTOR calorimeter, which was used by CMS in 2010, are also available and represent the first release of data from the very-forward region of CMS. Finally, CMS has released instructions and examples on how to generate simulated events and how to analyse data in isolated “containers”, within which one has access to the CMS software environment required for specific datasets. It is also easier to search through the simulated data and to discover the provenance of datasets.

As before, the data are released into the public domain under the Creative Commons CC0 waiver via the CERN Open Data portal. The portal is openly developed by the CERN Information Technology department, in cooperation with the experimental collaborations who release open data on it.

You can also read the release announcement on the CERN Open Data portal and on the CMS website.

Other news