Published on

Building Google's 1 Billion word corpus

Kaggle's version is identical

Google Corpuses

Building the 1 billion word corpus from source

Building Google's 1-billion-word language modelling benchmark is much more involved than expected, and also contains a large number of duplicate sentences. When removed, the number of words in the corpus is reduced from 2.9G to 0.8G.

OTOH, the checksums proved that the process was going Ok, even though the line-by-line aggregate produced mid-way seemed to have extra double-quotation marks at the beginning of each line.

pushd data/0-orig-1-billion

git clone
cd 1-billion-word-language-modeling-benchmark/tar_archives


cd ..
ls -l tar_archives/
#-rw-rw-r--. 1 andrewsm andrewsm 10582262657 Feb 24  2011 training-monolingual.tgz
md5sum tar_archives/training-monolingual.tgz

tar --extract -v --file tar_archives/training-monolingual.tgz --wildcards training-monolingual/news.20??.en.shuffled
md5sum training-monolingual/*

mkdir tmp.tmp

# Now build the corpus files :
TMPDIR=tmp.tmp ./scripts/

#real	37m33.908s
#user	63m49.485s
#sys	0m17.388s

rmdir tmp.tmp

more training-monolingual.tokenized.shuffled/news.en-00001-of-00100
#more heldout-monolingual.tokenized.shuffled/news.en-00000-of-00100  ## ignore
head heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050

cat training-monolingual.tokenized.shuffled/news.en-* > training-monolingual.tokenized.shuffled.all
wc training-monolingual.tokenized.shuffled.all
#  30,301,028  768,646,526 4,147,291,308 training-monolingual.tokenized.shuffled.all

cat heldout-monolingual.tokenized.shuffled/news.en.* > heldout-monolingual.tokenized.shuffled.all
wc heldout-monolingual.tokenized.shuffled.all
#  306,688  7,789,987 42,021,857 heldout-monolingual.tokenized.shuffled.all

rm -r training-monolingual
rm -r training-monolingual.tokenized
rm -r training-monolingual.tokenized.shuffled

rm -r heldout-monolingual.tokenized.shuffled


All that being done, the version produced ends up IDENTICAL (eg: same md5sum) to the version issued by Kaggle for their 1 Billion Word Imputation competition. I should really submit a PR that points that out to the corpus' GitHub README...