torchnlp.word_to_vector package

The torchnlp.word_to_vector package introduces multiple pretrained word vectors. The package handles downloading, caching, loading, and lookup.

class torchnlp.word_to_vector.GloVe(name='840B', dim=300, **kwargs)[source]

Word vectors derived from word-word co-occurrence statistics from a corpus by Stanford.

GloVe is essentially a log-bilinear model with a weighted least-squares objective. The main intuition underlying the model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning.

Reference: https://nlp.stanford.edu/projects/glove/

Parameters:
  • name (str) – name of the GloVe vectors (‘840B’, ‘twitter.27B’, ‘6B’, ‘42B’)
  • cache (str, optional) – directory for cached vectors
  • unk_init (callback, optional) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size
  • is_include (callable, optional) – callable returns True if to include a token in memory vectors cache; some of these embedding files are gigantic so filtering it can cut down on the memory usage. We do not cache on disk if is_include is defined.

Example

>>> from torchnlp.word_to_vector import GloVe  # doctest: +SKIP
>>> vectors = GloVe()  # doctest: +SKIP
>>> vectors['hello']  # doctest: +SKIP
-1.7494
0.6242
...
-0.6202
2.0928
[torch.FloatTensor of size 100]
class torchnlp.word_to_vector.BPEmb(language='en', dim=300, merge_ops=50000, **kwargs)[source]

Byte-Pair Encoding (BPE) embeddings trained on Wikipedia for 275 languages

A collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE). In an evaluation using fine-grained entity typing as testbed, BPEmb performs competitively, and for some languages better than alternative subword approaches, while requiring vastly fewer resources and no tokenization.

References

Parameters:
  • language (str, optional) – Language of the corpus on which the embeddings have been trained
  • dim (int, optional) – Dimensionality of the embeddings
  • merge_ops (int, optional) – Number of merge operations used by the tokenizer

Example

>>> from torchnlp.word_to_vector import BPEmb  # doctest: +SKIP
>>> vectors = BPEmb(dim=25)  # doctest: +SKIP
>>> subwords = "▁mel ford shire".split()  # doctest: +SKIP
>>> vectors[subwords]  # doctest: +SKIP
Columns 0 to 9
-0.5859 -0.1803  0.2623 -0.6052  0.0194 -0.2795  0.2716 -0.2957 -0.0492
1.0934
 0.3848 -0.2412  1.0599 -0.8588 -1.2596 -0.2534 -0.5704  0.2168 -0.1718
1.2675
 1.4407 -0.0996  1.2239 -0.5085 -0.7542 -0.9628 -1.7177  0.0618 -0.4025
1.0405
...
Columns 20 to 24
-0.0022  0.4820 -0.5156 -0.0564  0.4300
 0.0355 -0.2257  0.1323  0.6053 -0.8878
-0.0167 -0.3686  0.9666  0.2497 -1.2239
[torch.FloatTensor of size 3x25]
class torchnlp.word_to_vector.FastText(language='en', aligned=False, **kwargs)[source]

Enriched word vectors with subword information from Facebook’s AI Research (FAIR) lab.

A approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations.

References

Parameters:
  • language (str) – language of the vectors
  • aligned (bool) – if True: use multilingual embeddings where words with the same meaning share (approximately) the same position in the vector space across languages. if False: use regular FastText embeddings. All available languages can be found under https://github.com/facebookresearch/MUSE#multilingual-word-embeddings
  • cache (str, optional) – directory for cached vectors
  • unk_init (callback, optional) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size
  • is_include (callable, optional) – callable returns True if to include a token in memory vectors cache; some of these embedding files are gigantic so filtering it can cut down on the memory usage. We do not cache on disk if is_include is defined.

Example

>>> from torchnlp.word_to_vector import FastText  # doctest: +SKIP
>>> vectors = FastText()  # doctest: +SKIP
>>> vectors['hello']  # doctest: +SKIP
-0.1595
-0.1826
...
0.2492
0.0654
[torch.FloatTensor of size 300]
class torchnlp.word_to_vector.CharNGram(**kwargs)[source]

Character n-gram is a character-based compositional model to embed textual sequences.

Character n-gram embeddings are trained by the same Skip-gram objective. The final character embedding is the average of the unique character n-gram embeddings of wt. For example, the character n-grams (n = 1, 2, 3) of the word “Cat” are {C, a, t, #B#C, Ca, at, t#E#, #B#Ca, Cat, at#E#}, where “#B#” and “#E#” represent the beginning and the end of each word, respectively. Using the character embeddings efficiently provides morphological features. Each word is subsequently represented as xt, the concatenation of its corresponding word and character embeddings shared across the tasks.

Reference: http://www.logos.t.u-tokyo.ac.jp/~hassy/publications/arxiv2016jmt/

Parameters:
  • cache (str, optional) – directory for cached vectors
  • unk_init (callback, optional) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size
  • is_include (callable, optional) – callable returns True if to include a token in memory vectors cache; some of these embedding files are gigantic so filtering it can cut down on the memory usage. We do not cache on disk if is_include is defined.

Example

>>> from torchnlp.word_to_vector import CharNGram  # doctest: +SKIP
>>> vectors = CharNGram()  # doctest: +SKIP
>>> vectors['hello']  # doctest: +SKIP
-1.7494
0.6242
...
-0.6202
2.0928
[torch.FloatTensor of size 100]