torchnlp.datasets package

The torchnlp.datasets package introduces modules capable of downloading, caching and loading commonly used NLP datasets.

Modules return a torch.utils.data.Dataset object i.e, they have __getitem__ and __len__ methods implemented. Hence, they can all be passed to a torch.utils.data.DataLoader which can load multiple samples parallelly using torch.multiprocessing workers.

torchnlp.datasets.wmt_dataset(directory='data/wmt16_en_de', train=False, dev=False, test=False, train_filename='train.tok.clean.bpe.32000', dev_filename='newstest2013.tok.bpe.32000', test_filename='newstest2014.tok.bpe.32000', check_files=['train.tok.clean.bpe.32000.en'], url='https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8')[source]

The Workshop on Machine Translation (WMT) 2014 English-German dataset.

Initially this dataset was preprocessed by Google Brain. Though this download contains test sets from 2015 and 2016, the train set differs slightly from WMT 2015 and 2016 and significantly from WMT 2017.

The provided data is mainly taken from version 7 of the Europarl corpus, which is freely available. Note that this the same data as last year, since Europarl is not anymore translted across all 23 official European languages. Additional training data is taken from the new News Commentary corpus. There are about 50 million words of training data per language from the Europarl corpus and 3 million words from the News Commentary corpus.

A new data resource from 2013 is the Common Crawl corpus which was collected from web sources. Each parallel corpus comes with a annotation file that gives the source of each sentence pair.

References

Parameters:
  • directory (str, optional) – Directory to cache the dataset.
  • train (bool, optional) – If to load the training split of the dataset.
  • dev (bool, optional) – If to load the dev split of the dataset.
  • test (bool, optional) – If to load the test split of the dataset.
  • train_filename (str, optional) – The filename of the training split.
  • dev_filename (str, optional) – The filename of the dev split.
  • test_filename (str, optional) – The filename of the test split.
  • check_files (str, optional) – Check if these files exist, then this download was successful.
  • url (str, optional) – URL of the dataset tar.gz file.
Returns:

Returns between one and all dataset splits (train, dev and test) depending on if their respective boolean argument is True.

Return type:

tuple of iterable or iterable

Example

>>> from torchnlp.datasets import wmt_dataset  # doctest: +SKIP
>>> train = wmt_dataset(train=True)  # doctest: +SKIP
>>> train[:2]  # doctest: +SKIP
[{
  'en': 'Res@@ um@@ ption of the session',
  'de': 'Wiederaufnahme der Sitzungsperiode'
}, {
  'en': 'I declare resumed the session of the European Parliament ad@@ jour@@ ned on...'
  'de': 'Ich erklär@@ e die am Freitag , dem 17. Dezember unterbro@@ ch@@ ene...'
}]
torchnlp.datasets.iwslt_dataset(directory='data/iwslt/', train=False, dev=False, test=False, language_extensions=['en', 'de'], train_filename='{source}-{target}/train.{source}-{target}.{lang}', dev_filename='{source}-{target}/IWSLT16.TED.tst2013.{source}-{target}.{lang}', test_filename='{source}-{target}/IWSLT16.TED.tst2014.{source}-{target}.{lang}', check_files=['{source}-{target}/train.tags.{source}-{target}.{source}'], url='https://wit3.fbk.eu/archive/2016-01/texts/{source}/{target}/{source}-{target}.tgz')[source]

Load the International Workshop on Spoken Language Translation (IWSLT) 2017 translation dataset.

In-domain training, development and evaluation sets were supplied through the website of the WIT3 project, while out-of-domain training data were linked in the workshop’s website. With respect to edition 2016 of the evaluation campaign, some of the talks added to the TED repository during the last year have been used to define the evaluation sets (tst2017), while the remaining new talks have been included in the training sets.

The English data that participants were asked to recognize and translate consists in part of TED talks as in the years before, and in part of real-life lectures and talks that have been mainly recorded in lecture halls at KIT and Carnegie Mellon University. TED talks are challenging due to their variety in topics, but are very benign as they are very thoroughly rehearsed and planned, leading to easy to recognize and translate language.

Note

The order examples are returned is not guaranteed due to iglob.

References

Citation: M. Cettolo, C. Girardi, and M. Federico. 2012. WIT3: Web Inventory of Transcribed and Translated Talks. In Proc. of EAMT, pp. 261-268, Trento, Italy.

Parameters:
  • directory (str, optional) – Directory to cache the dataset.
  • train (bool, optional) – If to load the training split of the dataset.
  • dev (bool, optional) – If to load the dev split of the dataset.
  • test (bool, optional) – If to load the test split of the dataset.
  • language_extensions (list of str) – Two language extensions [‘en’|’de’|’it’|’ni’|’ro’] to load.
  • train_filename (str, optional) – The filename of the training split.
  • dev_filename (str, optional) – The filename of the dev split.
  • test_filename (str, optional) – The filename of the test split.
  • check_files (str, optional) – Check if these files exist, then this download was successful.
  • url (str, optional) – URL of the dataset file.
Returns:

Returns between one and all dataset splits (train, dev and test) depending on if their respective boolean argument is True.

Return type:

tuple of iterable or iterable

Example

>>> from torchnlp.datasets import iwslt_dataset  # doctest: +SKIP
>>> train = iwslt_dataset(train=True)  # doctest: +SKIP
>>> train[:2]  # doctest: +SKIP
[{
  'en': "David Gallo: This is Bill Lange. I'm Dave Gallo.",
  'de': 'David Gallo: Das ist Bill Lange. Ich bin Dave Gallo.'
}, {
  'en': "And we're going to tell you some stories from the sea here in video.",
  'de': 'Wir werden Ihnen einige Geschichten über das Meer in Videoform erzählen.'
}]
torchnlp.datasets.multi30k_dataset(directory='data/multi30k/', train=False, dev=False, test=False, train_filename='train', dev_filename='val', test_filename='test', check_files=['train.de', 'val.de'], urls=['http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz', 'http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz', 'http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/mmt16_task1_test.tar.gz'])[source]

Load the WMT 2016 machine translation dataset.

As a translation task, this task consists in translating English sentences that describe an image into German, given the English sentence itself. As training and development data, we provide 29,000 and 1,014 triples respectively, each containing an English source sentence, its German human translation. As test data, we provide a new set of 1,000 tuples containing an English description.

Status:
Host www.quest.dcs.shef.ac.uk forgot to update their SSL certificate; therefore, this dataset does not download securely.

References

Citation

@article{elliott-EtAl:2016:VL16,
    author    = {{Elliott}, D. and {Frank}, S. and {Sima'an}, K. and {Specia}, L.},
    title     = {Multi30K: Multilingual English-German Image Descriptions},
    booktitle = {Proceedings of the 5th Workshop on Vision and Language},
    year      = {2016},
    pages     = {70--74},
    year      = 2016
}
Parameters:
  • directory (str, optional) – Directory to cache the dataset.
  • train (bool, optional) – If to load the training split of the dataset.
  • dev (bool, optional) – If to load the dev split of the dataset.
  • test (bool, optional) – If to load the test split of the dataset.
  • train_directory (str, optional) – The directory of the training split.
  • dev_directory (str, optional) – The directory of the dev split.
  • test_directory (str, optional) – The directory of the test split.
  • check_files (str, optional) – Check if these files exist, then this download was successful.
  • urls (str, optional) – URLs to download.
Returns:

Returns between one and all dataset splits (train, dev and test) depending on if their respective boolean argument is True.

Return type:

tuple of iterable or iterable

Example

>>> from torchnlp.datasets import multi30k_dataset  # doctest: +SKIP
>>> train = multi30k_dataset(train=True)  # doctest: +SKIP
>>> train[:2]  # doctest: +SKIP
[{
  'en': 'Two young, White males are outside near many bushes.',
  'de': 'Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.'
}, {
  'en': 'Several men in hard hatsare operating a giant pulley system.',
  'de': 'Mehrere Männer mit Schutzhelmen bedienen ein Antriebsradsystem.'
}]
torchnlp.datasets.snli_dataset(directory='data/', train=False, dev=False, test=False, train_filename='snli_1.0_train.jsonl', dev_filename='snli_1.0_dev.jsonl', test_filename='snli_1.0_test.jsonl', extracted_name='snli_1.0', check_files=['snli_1.0/snli_1.0_train.jsonl'], url='http://nlp.stanford.edu/projects/snli/snli_1.0.zip')[source]

Load the Stanford Natural Language Inference (SNLI) dataset.

The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). We aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation learning methods, as well as a resource for developing NLP models of any kind.

Reference: https://nlp.stanford.edu/projects/snli/

Citation: Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Parameters:
  • directory (str, optional) – Directory to cache the dataset.
  • train (bool, optional) – If to load the training split of the dataset.
  • dev (bool, optional) – If to load the development split of the dataset.
  • test (bool, optional) – If to load the test split of the dataset.
  • train_filename (str, optional) – The filename of the training split.
  • dev_filename (str, optional) – The filename of the development split.
  • test_filename (str, optional) – The filename of the test split.
  • extracted_name (str, optional) – Name of the extracted dataset directory.
  • check_files (str, optional) – Check if these files exist, then this download was successful.
  • url (str, optional) – URL of the dataset tar.gz file.
Returns:

Returns between one and all dataset splits (train, dev and test) depending on if their respective boolean argument is True.

Return type:

tuple of iterable or iterable

Example

>>> from torchnlp.datasets import snli_dataset  # doctest: +SKIP
>>> train = snli_dataset(train=True)  # doctest: +SKIP
>>> train[0]  # doctest: +SKIP
{
  'premise': 'Kids are on a amusement ride.',
  'hypothesis': 'A car is broke down on the side of the road.',
  'label': 'contradiction',
  'premise_transitions': ['shift', 'shift', 'shift', 'shift', 'shift', 'shift', ...],
  'hypothesis_transitions': ['shift', 'shift', 'shift', 'shift', 'shift', 'shift', ...],
}
torchnlp.datasets.simple_qa_dataset(directory='data/', train=False, dev=False, test=False, extracted_name='SimpleQuestions_v2', train_filename='annotated_fb_data_train.txt', dev_filename='annotated_fb_data_valid.txt', test_filename='annotated_fb_data_test.txt', check_files=['SimpleQuestions_v2/annotated_fb_data_train.txt'], url='https://www.dropbox.com/s/tohrsllcfy7rch4/SimpleQuestions_v2.tgz?raw=1')[source]

Load the SimpleQuestions dataset.

Single-relation factoid questions (simple questions) are common in many settings (e.g. Microsoft’s search query logs and WikiAnswers questions). The SimpleQuestions dataset is one of the most commonly used benchmarks for studying single-relation factoid questions.

Reference: https://research.fb.com/publications/large-scale-simple-question-answering-with-memory-networks/

Parameters:
  • directory (str, optional) – Directory to cache the dataset.
  • train (bool, optional) – If to load the training split of the dataset.
  • dev (bool, optional) – If to load the development split of the dataset.
  • test (bool, optional) – If to load the test split of the dataset.
  • extracted_name (str, optional) – Name of the extracted dataset directory.
  • train_filename (str, optional) – The filename of the training split.
  • dev_filename (str, optional) – The filename of the development split.
  • test_filename (str, optional) – The filename of the test split.
  • check_files (str, optional) – Check if these files exist, then this download was successful.
  • url (str, optional) – URL of the dataset tar.gz file.
Returns:

Returns between one and all dataset splits (train, dev and test) depending on if their respective boolean argument is True.

Return type:

tuple of iterable or iterable

Example

>>> from torchnlp.datasets import simple_qa_dataset  # doctest: +SKIP
>>> train = simple_qa_dataset(train=True)  # doctest: +SKIP
SimpleQuestions_v2.tgz:  15%|▏| 62.3M/423M [00:09<00:41, 8.76MB/s]
>>> train[0:2]  # doctest: +SKIP
[{
  'question': 'what is the book e about',
  'relation': 'www.freebase.com/book/written_work/subjects',
  'object': 'www.freebase.com/m/01cj3p',
  'subject': 'www.freebase.com/m/04whkz5'
}, {
  'question': 'to what release does the release track cardiac arrest come from',
  'relation': 'www.freebase.com/music/release_track/release',
  'object': 'www.freebase.com/m/0sjc7c1',
  'subject': 'www.freebase.com/m/0tp2p24'
}]
torchnlp.datasets.imdb_dataset(directory='data/', train=False, test=False, train_directory='train', test_directory='test', extracted_name='aclImdb', check_files=['aclImdb/README'], url='http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz', sentiments=['pos', 'neg'])[source]

Load the IMDB dataset (Large Movie Review Dataset v1.0).

This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. Provided a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.

Note

The order examples are returned is not guaranteed due to iglob.

Reference: http://ai.stanford.edu/~amaas/data/sentiment/

Parameters:
  • directory (str, optional) – Directory to cache the dataset.
  • train (bool, optional) – If to load the training split of the dataset.
  • test (bool, optional) – If to load the test split of the dataset.
  • train_directory (str, optional) – The directory of the training split.
  • test_directory (str, optional) – The directory of the test split.
  • extracted_name (str, optional) – Name of the extracted dataset directory.
  • check_files (str, optional) – Check if these files exist, then this download was successful.
  • url (str, optional) – URL of the dataset tar.gz file.
  • sentiments (list of str, optional) – Sentiments to load from the dataset.
Returns:

Returns between one and all dataset splits (train, dev and test) depending on if their respective boolean argument is True.

Return type:

tuple of iterable or iterable

Example

>>> from torchnlp.datasets import imdb_dataset  # doctest: +SKIP
>>> train = imdb_dataset(train=True)  # doctest: +SKIP
>>> train[0:2]  # doctest: +SKIP
[{
  'text': 'For a movie that gets no respect there sure are a lot of memorable quotes...',
  'sentiment': 'pos'
}, {
  'text': 'Bizarre horror movie filled with famous faces but stolen by Cristina Raines...',
  'sentiment': 'pos'
}]
torchnlp.datasets.wikitext_2_dataset(directory='data/', train=False, dev=False, test=False, train_filename='wiki.train.tokens', dev_filename='wiki.valid.tokens', test_filename='wiki.test.tokens', extracted_name='wikitext-2', check_files=['wikitext-2/wiki.train.tokens'], url='https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip', unknown_token='<unk>', eos_token='</s>')[source]

Load the WikiText-2 dataset.

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

Reference: https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset

Parameters:
  • directory (str, optional) – Directory to cache the dataset.
  • train (bool, optional) – If to load the training split of the dataset.
  • dev (bool, optional) – If to load the development split of the dataset.
  • test (bool, optional) – If to load the test split of the dataset.
  • train_filename (str, optional) – The filename of the training split.
  • dev_filename (str, optional) – The filename of the development split.
  • test_filename (str, optional) – The filename of the test split.
  • extracted_name (str, optional) – Name of the extracted dataset directory.
  • check_files (str, optional) – Check if these files exist, then this download was successful.
  • url (str, optional) – URL of the dataset tar.gz file.
  • unknown_token (str, optional) – Token to use for unknown words.
  • eos_token (str, optional) – Token to use at the end of sentences.
Returns:

Returns between one and all dataset splits (train, dev and test) depending on if their respective boolean argument is True.

Return type:

tuple of iterable or iterable

Example

>>> from torchnlp.datasets import wikitext_2_dataset  # doctest: +SKIP
>>> train = wikitext_2_dataset(train=True)  # doctest: +SKIP
>>> train[:10]  # doctest: +SKIP
['</s>', '=', 'Valkyria', 'Chronicles', 'III', '=', '</s>', '</s>', 'Senjō', 'no']
torchnlp.datasets.penn_treebank_dataset(directory='data/penn-treebank', train=False, dev=False, test=False, train_filename='ptb.train.txt', dev_filename='ptb.valid.txt', test_filename='ptb.test.txt', check_files=['ptb.train.txt'], urls=['https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.train.txt', 'https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.valid.txt', 'https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.test.txt'], unknown_token='<unk>', eos_token='</s>')[source]

Load the Penn Treebank dataset.

This is the Penn Treebank Project: Release 2 CDROM, featuring a million words of 1989 Wall Street Journal material.

Reference: https://catalog.ldc.upenn.edu/LDC99T42

Citation: Marcus, Mitchell P., Marcinkiewicz, Mary Ann & Santorini, Beatrice (1993). Building a Large Annotated Corpus of English: The Penn Treebank

Parameters:
  • directory (str, optional) – Directory to cache the dataset.
  • train (bool, optional) – If to load the training split of the dataset.
  • dev (bool, optional) – If to load the development split of the dataset.
  • test (bool, optional) – If to load the test split of the dataset.
  • train_filename (str, optional) – The filename of the training split.
  • dev_filename (str, optional) – The filename of the development split.
  • test_filename (str, optional) – The filename of the test split.
  • name (str, optional) – Name of the dataset directory.
  • check_files (str, optional) – Check if these files exist, then this download was successful.
  • urls (str, optional) – URLs to download.
  • unknown_token (str, optional) – Token to use for unknown words.
  • eos_token (str, optional) – Token to use at the end of sentences.
Returns:

Returns between one and all dataset splits (train, dev and test) depending on if their respective boolean argument is True.

Return type:

tuple of iterable or iterable

Example

>>> from torchnlp.datasets import penn_treebank_dataset  # doctest: +SKIP
>>> train = penn_treebank_dataset(train=True)  # doctest: +SKIP
>>> train[:10]  # doctest: +SKIP
['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano',
'guterman', 'hydro-quebec']
torchnlp.datasets.ud_pos_dataset(directory='data/', train=False, dev=False, test=False, train_filename='en-ud-tag.v2.train.txt', dev_filename='en-ud-tag.v2.dev.txt', test_filename='en-ud-tag.v2.test.txt', extracted_name='en-ud-v2', check_files=['en-ud-v2/en-ud-tag.v2.train.txt'], url='https://bitbucket.org/sivareddyg/public/downloads/en-ud-v2.zip')[source]

Load the Universal Dependencies - English Dependency Treebank dataset.

Corpus of sentences annotated using Universal Dependencies annotation. The corpus comprises 254,830 words and 16,622 sentences, taken from various web media including weblogs, newsgroups, emails, reviews, and Yahoo! answers.

References

Citation: Natalia Silveira and Timothy Dozat and Marie-Catherine de Marneffe and Samuel Bowman and Miriam Connor and John Bauer and Christopher D. Manning (2014). A Gold Standard Dependency Corpus for {E}nglish

Parameters:
  • directory (str, optional) – Directory to cache the dataset.
  • train (bool, optional) – If to load the training split of the dataset.
  • dev (bool, optional) – If to load the development split of the dataset.
  • test (bool, optional) – If to load the test split of the dataset.
  • train_filename (str, optional) – The filename of the training split.
  • dev_filename (str, optional) – The filename of the development split.
  • test_filename (str, optional) – The filename of the test split.
  • extracted_name (str, optional) – Name of the extracted dataset directory.
  • check_files (str, optional) – Check if these files exist, then this download was successful.
  • url (str, optional) – URL of the dataset tar.gz file.
Returns:

Returns between one and all dataset splits (train, dev and test) depending on if their respective boolean argument is True.

Return type:

tuple of iterable or iterable

Example

>>> from torchnlp.datasets import ud_pos_dataset  # doctest: +SKIP
>>> train = ud_pos_dataset(train=True)  # doctest: +SKIP
>>> train[17]  # doctest: +SKIP
{
  'tokens': ['Guerrillas', 'killed', 'an', 'engineer', ',', 'Asi', 'Ali', ',', 'from',
             'Tikrit', '.'],
  'ud_tags': ['NOUN', 'VERB', 'DET', 'NOUN', 'PUNCT', 'PROPN', 'PROPN', 'PUNCT', 'ADP',
              'PROPN', 'PUNCT'],
  'ptb_tags': ['NNS', 'VBD', 'DT', 'NN', ',', 'NNP', 'NNP', ',', 'IN', 'NNP', '.']
}
torchnlp.datasets.trec_dataset(directory='data/trec/', train=False, test=False, train_filename='train_5500.label', test_filename='TREC_10.label', check_files=['train_5500.label'], urls=['http://cogcomp.org/Data/QA/QC/train_5500.label', 'http://cogcomp.org/Data/QA/QC/TREC_10.label'], fine_grained=False)[source]

Load the Text REtrieval Conference (TREC) Question Classification dataset.

TREC dataset contains 5500 labeled questions in training set and another 500 for test set. The dataset has 6 labels, 50 level-2 labels. Average length of each sentence is 10, vocabulary size of 8700.

References

Citation: Xin Li, Dan Roth, Learning Question Classifiers. COLING’02, Aug., 2002.

Parameters:
  • directory (str, optional) – Directory to cache the dataset.
  • train (bool, optional) – If to load the training split of the dataset.
  • test (bool, optional) – If to load the test split of the dataset.
  • train_filename (str, optional) – The filename of the training split.
  • test_filename (str, optional) – The filename of the test split.
  • check_files (str, optional) – Check if these files exist, then this download was successful.
  • urls (str, optional) – URLs to download.
Returns:

Returns between one and all dataset splits (train, dev and test) depending on if their respective boolean argument is True.

Return type:

tuple of iterable or iterable

Example

>>> from torchnlp.datasets import trec_dataset  # doctest: +SKIP
>>> train = trec_dataset(train=True)  # doctest: +SKIP
>>> train[:2]  # doctest: +SKIP
[{
  'label': 'DESC',
  'text': 'How did serfdom develop in and then leave Russia ?'
}, {
  'label': 'ENTY',
  'text': 'What films featured the character Popeye Doyle ?'
}]
torchnlp.datasets.reverse_dataset(train=False, dev=False, test=False, train_rows=10000, dev_rows=1000, test_rows=1000, seq_max_length=10)[source]

Load the Reverse dataset.

The Reverse dataset is a simple task of reversing a list of numbers. This dataset is useful for testing implementations of sequence to sequence models.

Parameters:
  • train (bool, optional) – If to load the training split of the dataset.
  • dev (bool, optional) – If to load the development split of the dataset.
  • test (bool, optional) – If to load the test split of the dataset.
  • train_rows (int, optional) – Number of training rows to generate.
  • dev_rows (int, optional) – Number of development rows to generate.
  • test_rows (int, optional) – Number of test rows to generate.
  • seq_max_length (int, optional) – Maximum sequence length.
Returns:

Returns between one and all dataset splits (train, dev and test) depending on if their respective boolean argument is True.

Return type:

tuple of iterable or iterable

Example

>>> from torchnlp.random import set_seed
>>> set_seed(321)
>>>
>>> from torchnlp.datasets import reverse_dataset
>>> train = reverse_dataset(train=True)
>>> train[0:1]
[{'source': '6 2 5 8 7', 'target': '7 8 5 2 6'}]
torchnlp.datasets.count_dataset(train=False, dev=False, test=False, train_rows=10000, dev_rows=1000, test_rows=1000, seq_max_length=10)[source]

Load the Count dataset.

The Count dataset is a simple task of counting the number of integers in a sequence. This dataset is useful for testing implementations of sequence to label models.

Parameters:
  • train (bool, optional) – If to load the training split of the dataset.
  • dev (bool, optional) – If to load the development split of the dataset.
  • test (bool, optional) – If to load the test split of the dataset.
  • train_rows (int, optional) – Number of training rows to generate.
  • dev_rows (int, optional) – Number of development rows to generate.
  • test_rows (int, optional) – Number of test rows to generate.
  • seq_max_length (int, optional) – Maximum sequence length.
Returns:

Returns between one and all dataset splits (train, dev and test) depending on if their respective boolean argument is True.

Return type:

tuple of iterable or iterable

Example

>>> from torchnlp.random import set_seed
>>> set_seed(321)
>>>
>>> from torchnlp.datasets import count_dataset
>>> train = count_dataset(train=True)
>>> train[0:2]
[{'numbers': '6 2 5 8 7', 'count': '5'}, {'numbers': '3 9 7 6 6 7', 'count': '6'}]
torchnlp.datasets.zero_dataset(train=False, dev=False, test=False, train_rows=256, dev_rows=64, test_rows=64)[source]

Load the Zero dataset.

The Zero dataset is a simple task of predicting zero from zero. This dataset is useful for integration testing. The extreme simplicity of the dataset allows for models to learn the task quickly allowing for quick end-to-end testing.

Parameters:
  • train (bool, optional) – If to load the training split of the dataset.
  • dev (bool, optional) – If to load the development split of the dataset.
  • test (bool, optional) – If to load the test split of the dataset.
  • train_rows (int, optional) – Number of training rows to generate.
  • dev_rows (int, optional) – Number of development rows to generate.
  • test_rows (int, optional) – Number of test rows to generate.
Returns:

Returns between one and all dataset splits (train, dev and test) depending on if their respective boolean argument is True.

Return type:

tuple of iterable or iterable

Example

>>> from torchnlp.datasets import zero_dataset
>>> train = zero_dataset(train=True)
>>> train[0:2]
[{'source': '0', 'target': '0'}, {'source': '0', 'target': '0'}]
torchnlp.datasets.smt_dataset(directory='data/', train=False, dev=False, test=False, train_filename='train.txt', dev_filename='dev.txt', test_filename='test.txt', extracted_name='trees', check_files=['trees/train.txt'], url='http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip', fine_grained=False, subtrees=False)[source]

Load the Stanford Sentiment Treebank dataset.

Semantic word spaces have been very useful but cannot express the meaning of longer phrases in a principled way. Further progress towards understanding compositionality in tasks such as sentiment detection requires richer supervised training and evaluation resources and more powerful models of composition. To remedy this, we introduce a Sentiment Treebank. It includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality.

Reference: https://nlp.stanford.edu/sentiment/index.html

Citation: Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng and Christopher Potts. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

Parameters:
  • directory (str, optional) – Directory to cache the dataset.
  • train (bool, optional) – If to load the training split of the dataset.
  • dev (bool, optional) – If to load the development split of the dataset.
  • test (bool, optional) – If to load the test split of the dataset.
  • train_filename (str, optional) – The filename of the training split.
  • dev_filename (str, optional) – The filename of the development split.
  • test_filename (str, optional) – The filename of the test split.
  • extracted_name (str, optional) – Name of the extracted dataset directory.
  • check_files (str, optional) – Check if these files exist, then this download was successful.
  • url (str, optional) – URL of the dataset tar.gz file.
  • subtrees (bool, optional) – Whether to include sentiment-tagged subphrases in addition to complete examples.
  • fine_grained (bool, optional) – Whether to use 5-class instead of 3-class labeling.
Returns:

Returns between one and all dataset splits (train, dev and test) depending on if their respective boolean argument is True.

Return type:

tuple of iterable or iterable

Example

>>> from torchnlp.datasets import smt_dataset  # doctest: +SKIP
>>> train = smt_dataset(train=True)  # doctest: +SKIP
>>> train[5]  # doctest: +SKIP
{
  'text': "Whether or not you 're enlightened by any of Derrida 's lectures on ...",
  'label': 'positive'
}
torchnlp.datasets.squad_dataset(directory='data/', train=False, dev=False, train_filename='train-v2.0.json', dev_filename='dev-v2.0.json', check_files_train=['train-v2.0.json'], check_files_dev=['dev-v2.0.json'], url_train='https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json', url_dev='https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json')[source]

Load the Stanford Question Answering Dataset (SQuAD) dataset.

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

Reference: https://rajpurkar.github.io/SQuAD-explorer/ Citation: Rajpurkar, P., Jia, R. and Liang, P., 2018. Know what you don’t know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822.

Parameters:
  • directory (str, optional) – Directory to cache the dataset.
  • train (bool, optional) – If to load the training split of the dataset.
  • dev (bool, optional) – If to load the development split of the dataset.
  • train_filename (str, optional) – The filename of the training split.
  • dev_filename (str, optional) – The filename of the development split.
  • check_files_train (list, optional) – All train filenames
  • check_files_dev (list, optional) – All development filenames
  • url_train (str, optional) – URL of the train dataset .json file.
  • url_dev (str, optional) – URL of the dev dataset .json file.
Returns:

Returns between one and all dataset splits (train and dev) depending on if their respective boolean argument is True.

Return type:

tuple of iterable or iterable

Example

>>> from torchnlp.datasets import squad_dataset  # doctest: +SKIP
>>> train = squad_dataset(train=True)  # doctest: +SKIP
>>> train[0]['paragraphs'][0]['qas'][0]['question']  # doctest: +SKIP
'When did Beyonce start becoming popular?'
>>> train[0]['paragraphs'][0]['qas'][0]['answers'][0]  # doctest: +SKIP
{'text': 'in the late 1990s', 'answer_start': 269}