WIMBD: Corpora
We cover ten different corpora, including text-only corpora (e.g., C4), captions from image-captioning (LAION-2B-en), and code (The Stack). A high level description of these datasets using WIMBD is presented in the summary table and we provide here some information about those corpora.
ElasticSearch Access
We have indexed several of the corpora used in this work using ElasticSearch. Due to the nature of ES, we are not able to release its keys publicly, but we can provide individual access keys upon request. Please fill in the following form.
Note that due to some legal issues we are unable to provide access to LAION and the Pile.
Instead, we can provide access to C4, OpenWebText, RedPajama, and Dolma.
Corpus | Model | Size (GB) | # Documents | # Tokens | max(# Tokens) | min(# Tokens) |
---|---|---|---|---|---|---|
GPT-2* | 41.2 | 8,005,939 | 7,767,705,349 | 95,139 | 128 | |
T5 | 838.7 | 364,868,892 | 153,607,833,664 | 101,898 | 5 | |
umT5 | 14,694.0 | 3,928,733,374 | 2,703,077,876,916 | 181,949 | 1 | |
BLOOM* | 3,327.3 | 431,584,362 | 475,992,028,559 | 1,048,409 | 1 | |
GPT-J/Neo & pythia | 1,369.0 | 210,607,728 | 285,794,281,816 | 28,121,329 | 0 | |
LLaMA* | 5,602.0 | 930,453,833 | 1,023,865,191,958 | 28,121,329 | 0 | |
SciBERT* | 692.7 | 11,241,499 | 59,863,121,791 | 376,681 | 1 | |
- | 504.3 | 8,242,162 | 44,024,690,229 | 97,043 | 154 | |
Stable Diffusion* | 570.2 | 2,319,907,827 | 29,643,340,153 | 131,077 | 0 | |
StarCoder* | 7,830.8 | 544,750,672 | 1,525,618,728,620 | 26,298,134 | 0 |
OpenWebText
OpenWebText is an open-source reproduction of the data used to train GPT-2. Due to the limited information provided by the GPT-2 paper, and never releasing the data, it is unclear how similar OpenWebText is to the original data (WebText), but similar steps to the paper's reports were conducted (such as deduplication, non-English filtering, min-length filtering, etc.).
C4
C4 is the dataset used by Raffel et al., 2020 for trainingT5. The dataset: The Colossal Clean Crawled Corpus (C4 in short) is based on Common Crawl as a source of text that was scraped from the web. As such, a lot of the data is noisy, and a set of heuristics were employed to clean it up, such as filtering documents by length, obscene/bad words, duplicate texts, non-english, etc. C4 was not released by Raffel et al., and instead, it was scraped, cleaned, filtered, and released by Dodge et al., 2021
mC4-en
mC4-en is a multilingual version of C4 that was used to train mT5 (Xue et al., 2021) and later umT5 (Chung et al., 2023). We use the latest version (v.3.1.0) which was used to train umT5, containing documents collected from Common Crawl through August 2022, and in practice the portion of the data that is classified as English. The main difference of mC4-en over C4 is a higher confidence by a language classifier (from 0.7 to 0.96), while also allowing a 0.1% random set of documents that contain ``bad words'' to pass through, and adaptation of the ``bad words'' list that resulted in filtering more than 10% of the documents in a language.
OSCAR
OSCAR is a multilingual corpus based on Common Crawl. It contains a length filter for improving data quality that filters out documents with short sentences. They also annotate the data with different labels, such as the language of the document, adult content, and language identification, which they use for different analyses. It is an ongoing effort, and the corpus is maintained and updated regularly.
The Pile
The Pile is a corpus consisting of 22 different domains . Unlike C4, the data was not scrapped from the web and then filtered, but pre-selected, with the motivation that this way the data will be of higher quality. The included domains in The Pile are diverse: they include data such as Wikipedia, Github, Arxiv, EuroParl, and more. By design, most datasets are upsampled in the hope to increase data quality, from 1.5x with domains such as OpenSubtitles, up to 3x with Wikipedia. Models such as GPT-J, GPT-neoand Pythia were trained on this dataset.
As of October 2023 (perhaps even earlier), The Pile is no longer available for download.
RedPajama
RedPajama is an open-source version reproduction of the data used to train LLaMA, and was used to train RedPajama-INCITE.
S2ORC
S2ORC is a large corpus of English academic papers, which consists the abstracts, full text, including figures, tables, and references. The texts are automatically extracted from pdfs and LaTeX sources.
peS2o
peS2o is a derivative of S2ORC, cleaned and filtered to obtain a more usable version of the data intended to train language models. We use peS2o V2.
LAION-2B-en
LAION is a large dataset of images and captions scraped from Common Crawl. The main dataset (LAION-5B) contains 5.8 billion examples, of which 2.32 billion of the captions are in English (LAION-2B-en), which we use in this work. We focus on the text captions but demonstrate qualitative examples using the associated URLs and images when appropriate.
The Stack
The Stack is a source-code dataset that was collected for training language models, and parts of it were used to train SantaCoderand MPT. It was compiled from GHArchive with some filters: files that cannot contribute to training code such as binary files, files larger than 1MB, and some extensions. In addition, only repositories with permissive licenses were included (18 license types in the version v1.0, and 193 in version v1.1), and we use the v1.2. While the main purpose of code is to provide machine instructions to perform different functionalities, it also contain natural language in the form of comments: ``Roughly 40 natural languages are present in docstrings and comments with English being the most prevalent. In python files, it makes up ~96% of the dataset.''