WIMBD: Corpora

We cover ten different corpora, including text-only corpora (e.g., C4), captions from image-captioning (LAION-2B-en), and code (The Stack). A high level description of these datasets using WIMBD is presented in the summary table and we provide here some information about those corpora.

ElasticSearch Access
We have indexed several of the corpora used in this work using ElasticSearch. Due to the nature of ES, we are not able to release its keys publicly, but we can provide individual access keys upon request. Please fill in the following form.
Note that due to some legal issues we are unable to provide access to LAION and the Pile.
Instead, we can provide access to C4, OpenWebText, RedPajama, and Dolma.

Summary statistics of corpora, along with models trained on such dataset. The number of tokens are calculated using unicode segmentation tokenizer. Models noted with * signifies the model was not trained exactly on the version we consider, either due to some filtering, using additional data, or the original data being private.
Corpus	Model	Size (GB)	# Documents	# Tokens	max(# Tokens)	min(# Tokens)
OpenWebText	GPT-2*	41.2	8,005,939	7,767,705,349	95,139	128
C4	T5	838.7	364,868,892	153,607,833,664	101,898	5
mC4-en	umT5	14,694.0	3,928,733,374	2,703,077,876,916	181,949	1
OSCAR	BLOOM*	3,327.3	431,584,362	475,992,028,559	1,048,409	1
The Pile	GPT-J/Neo & pythia	1,369.0	210,607,728	285,794,281,816	28,121,329	0
RedPajama	LLaMA*	5,602.0	930,453,833	1,023,865,191,958	28,121,329	0
S2Orc	SciBERT*	692.7	11,241,499	59,863,121,791	376,681	1
peS2o	-	504.3	8,242,162	44,024,690,229	97,043	154
LAION-2B-en	Stable Diffusion*	570.2	2,319,907,827	29,643,340,153	131,077	0
The Stack	StarCoder*	7,830.8	544,750,672	1,525,618,728,620	26,298,134	0

OpenWebText

OpenWebText is an open-source reproduction of the data used to train GPT-2. Due to the limited information provided by the GPT-2 paper, and never releasing the data, it is unclear how similar OpenWebText is to the original data (WebText), but similar steps to the paper's reports were conducted (such as deduplication, non-English filtering, min-length filtering, etc.).

C4

C4 is the dataset used by Raffel et al., 2020 for trainingT5. The dataset: The Colossal Clean Crawled Corpus (C4 in short) is based on Common Crawl as a source of text that was scraped from the web. As such, a lot of the data is noisy, and a set of heuristics were employed to clean it up, such as filtering documents by length, obscene/bad words, duplicate texts, non-english, etc. C4 was not released by Raffel et al., and instead, it was scraped, cleaned, filtered, and released by Dodge et al., 2021

mC4-en

mC4-en is a multilingual version of C4 that was used to train mT5 (Xue et al., 2021) and later umT5 (Chung et al., 2023). We use the latest version (v.3.1.0) which was used to train umT5, containing documents collected from Common Crawl through August 2022, and in practice the portion of the data that is classified as English. The main difference of mC4-en over C4 is a higher confidence by a language classifier (from 0.7 to 0.96), while also allowing a 0.1% random set of documents that contain ``bad words'' to pass through, and adaptation of the ``bad words'' list that resulted in filtering more than 10% of the documents in a language.

OSCAR

OSCAR is a multilingual corpus based on Common Crawl. It contains a length filter for improving data quality that filters out documents with short sentences. They also annotate the data with different labels, such as the language of the document, adult content, and language identification, which they use for different analyses. It is an ongoing effort, and the corpus is maintained and updated regularly.

The Pile

The Pile is a corpus consisting of 22 different domains . Unlike C4, the data was not scrapped from the web and then filtered, but pre-selected, with the motivation that this way the data will be of higher quality. The included domains in The Pile are diverse: they include data such as Wikipedia, Github, Arxiv, EuroParl, and more. By design, most datasets are upsampled in the hope to increase data quality, from 1.5x with domains such as OpenSubtitles, up to 3x with Wikipedia. Models such as GPT-J, GPT-neoand Pythia were trained on this dataset.

As of October 2023 (perhaps even earlier), The Pile is no longer available for download.

RedPajama

RedPajama is an open-source version reproduction of the data used to train LLaMA, and was used to train RedPajama-INCITE.

S2ORC

S2ORC is a large corpus of English academic papers, which consists the abstracts, full text, including figures, tables, and references. The texts are automatically extracted from pdfs and LaTeX sources.

peS2o

peS2o is a derivative of S2ORC, cleaned and filtered to obtain a more usable version of the data intended to train language models. We use peS2o V2.

LAION-2B-en

LAION is a large dataset of images and captions scraped from Common Crawl. The main dataset (LAION-5B) contains 5.8 billion examples, of which 2.32 billion of the captions are in English (LAION-2B-en), which we use in this work. We focus on the text captions but demonstrate qualitative examples using the associated URLs and images when appropriate.

The Stack

The Stack is a source-code dataset that was collected for training language models, and parts of it were used to train SantaCoderand MPT. It was compiled from GHArchive with some filters: files that cannot contribute to training code such as binary files, files larger than 1MB, and some extensions. In addition, only repositories with permissive licenses were included (18 license types in the version v1.0, and 193 in version v1.1), and we use the v1.2. While the main purpose of code is to provide machine instructions to perform different functionalities, it also contain natural language in the form of comments: ``Roughly 40 natural languages are present in docstrings and comments with English being the most prevalent. In python files, it makes up ~96% of the dataset.''