About WIMBD

This is the project page website for the "What's In My Big Data?" (WIMBD) project. You can learn more about it in our paper.

Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). What's In My Big Data? (WIMBD), is a platform and a set of 16 high-level analyses that allow us to reveal and compare the contents of large text corpora. WIMBD builds on two basic capabilities---count and search---at scale, which allows us to analyze more than 35 terabytes on a standard compute node.

We apply WIMBD to 10 different corpora used to train popular language models, including C4, The Pile, and RedPajama. Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content, personally identifiable information, toxic language, and benchmark contamination. We open-source WIMBD code and artifacts to provide a standard set of evaluations for new text-based corpora and to encourage more analyses and transparency around them.
Our goal is to build better scientific practices around data and use WIMBD to inform data decisions to clean and filter large-scale datasets, as well as to document existing ones.

WIMBD was developed by researchers from the Allen Institute for AI, Universiry of Washington, University of California, Berkeley, and University of California, Irvine.



Cite Us

@inproceedings{elazar2023s,
    title={What's In My Big Data?},
    author={Elazar, Yanai and Bhagia, Akshita and Magnusson, Ian Helgi and Ravichander, Abhilasha and Schwenk, Dustin and Suhr, Alane and Walsh, Evan Pete and Groeneveld, Dirk and Soldaini, Luca and Singh, Sameer and Hajishirzi, Hanna and Smith, Noah A. and Dodge, Jesse},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2023}
}