torchnlp.encoders package

The torchnlp.encoders package supports encoding objects as a vector torch.Tensor and decoding a vector torch.Tensor back.

class torchnlp.encoders.Encoder(enforce_reversible=False)[source]

Bases: object

Base class for a encoder employing an identity function.

Parameters:enforce_reversible (bool, optional) – Check for reversibility on Encoder.encode and Encoder.decode. Formally, reversible means: Encoder.decode(Encoder.encode(object_)) == object_.
batch_decode(iterator, *args, **kwargs)[source]
Parameters:
  • iterator (list) – Batch of encoded objects.
  • *args – Arguments passed to decode.
  • **kwargs – Keyword arguments passed to decode.
Returns:

Batch of decoded objects.

Return type:

list

batch_encode(iterator, *args, **kwargs)[source]
Parameters:
  • batch (list) – Batch of objects to encode.
  • *args – Arguments passed to encode.
  • **kwargs – Keyword arguments passed to encode.
Returns:

Batch of encoded objects.

Return type:

list

decode(encoded)[source]

Decodes an object.

Parameters:object (object) – Encoded object.
Returns:Object decoded.
Return type:object
encode(object_)[source]

Encodes an object.

Parameters:object (object) – Object to encode.
Returns:Encoding of the object.
Return type:object
class torchnlp.encoders.LabelEncoder(sample, min_occurrences=1, reserved_labels=['<unk>'], unknown_index=0, **kwargs)[source]

Bases: torchnlp.encoders.encoder.Encoder

Encodes an label via a dictionary.

Parameters:
  • sample (list of strings) – Sample of data used to build encoding dictionary.
  • min_occurrences (int, optional) – Minimum number of occurrences for a label to be added to the encoding dictionary.
  • reserved_labels (list, optional) – List of reserved labels inserted in the beginning of the dictionary.
  • unknown_index (int, optional) – The unknown label is used to encode unseen labels. This is the index that label resides at.
  • **kwargs – Keyword arguments passed onto Encoder.

Example

>>> samples = ['label_a', 'label_b']
>>> encoder = LabelEncoder(samples, reserved_labels=['unknown'], unknown_index=0)
>>> encoder.encode('label_a')
tensor(1)
>>> encoder.decode(encoder.encode('label_a'))
'label_a'
>>> encoder.encode('label_c')
tensor(0)
>>> encoder.decode(encoder.encode('label_c'))
'unknown'
>>> encoder.vocab
['unknown', 'label_a', 'label_b']
batch_decode(tensor, *args, dim=0, **kwargs)[source]
Parameters:
  • tensor (torch.Tensor) – Batch of tensors.
  • *args – Arguments passed to Encoder.batch_decode.
  • dim (int, optional) – Dimension along which to split tensors.
  • **kwargs – Keyword arguments passed to Encoder.batch_decode.
Returns:

Batch of decoded labels.

Return type:

list

batch_encode(iterator, *args, dim=0, **kwargs)[source]
Parameters:
  • iterator (iterator) – Batch of labels to encode.
  • *args – Arguments passed to Encoder.batch_encode.
  • dim (int, optional) – Dimension along which to concatenate tensors.
  • **kwargs – Keyword arguments passed to Encoder.batch_encode.
Returns:

Tensor of encoded labels.

Return type:

torch.Tensor

decode(encoded)[source]

Decodes encoded label.

Parameters:encoded (torch.Tensor) – Encoded label.
Returns:Label decoded from encoded.
Return type:object
encode(label)[source]

Encodes a label.

Parameters:label (object) – Label to encode.
Returns:Encoding of the label.
Return type:torch.Tensor
vocab

List of labels in the dictionary.

Type:Returns
Type:list
vocab_size

Number of labels in the dictionary.

Type:Returns
Type:int
class torchnlp.encoders.text.CharacterEncoder(*args, **kwargs)[source]

Bases: torchnlp.encoders.text.static_tokenizer_encoder.StaticTokenizerEncoder

Encodes text into a tensor by splitting the text into individual characters.

Parameters:
  • **args – Arguments passed onto StaticTokenizerEncoder.__init__.
  • **kwargs – Keyword arguments passed onto StaticTokenizerEncoder.__init__.
class torchnlp.encoders.text.DelimiterEncoder(delimiter, *args, **kwargs)[source]

Bases: torchnlp.encoders.text.static_tokenizer_encoder.StaticTokenizerEncoder

Encodes text into a tensor by splitting the text using a delimiter.

Parameters:
  • delimiter (string) – Delimiter used with string.split.
  • **args – Arguments passed onto StaticTokenizerEncoder.__init__.
  • **kwargs – Keyword arguments passed onto StaticTokenizerEncoder.__init__.

Example

>>> encoder = DelimiterEncoder('|', ['token_a|token_b', 'token_c'])
>>> encoder.encode('token_a|token_c')
tensor([5, 7])
>>> encoder.vocab
['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'token_a', 'token_b', 'token_c']
class torchnlp.encoders.text.MosesEncoder(*args, **kwargs)[source]

Bases: torchnlp.encoders.text.static_tokenizer_encoder.StaticTokenizerEncoder

Encodes the text using the Moses tokenizer.

Tokenizer Reference: http://www.nltk.org/_modules/nltk/tokenize/moses.html

Parameters:
  • **args – Arguments passed onto StaticTokenizerEncoder.__init__.
  • **kwargs – Keyword arguments passed onto StaticTokenizerEncoder.__init__.

NOTE: The doctest is skipped because running NLTK moses with Python 3.7’s pytest halts on travis.

Example

>>> encoder = MosesEncoder(["This ain't funny.", "Don't?"]) # doctest: +SKIP
>>> encoder.encode("This ain't funny.") # doctest: +SKIP
tensor([5, 6, 7, 8, 9])
>>> encoder.vocab # doctest: +SKIP
['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'This', 'ain', '&apos;t', 'funny', '.', 'Don', '?']
>>> encoder.decode(encoder.encode("This ain't funny.")) # doctest: +SKIP
"This ain't funny."
torchnlp.encoders.text.pad_tensor(tensor, length, padding_index=0)[source]

Pad a tensor to length with padding_index.

Parameters:
  • tensor (torch.Tensor [n, ..]) – Tensor to pad.
  • length (int) – Pad the tensor up to length.
  • padding_index (int, optional) – Index to pad tensor with.
Returns
(torch.Tensor [length, …]) Padded Tensor.
torchnlp.encoders.text.stack_and_pad_tensors(batch, padding_index=0, dim=0)[source]

Pad a list of tensors (batch) with padding_index.

Parameters:
  • batch (list of torch.Tensor) – Batch of tensors to pad.
  • padding_index (int, optional) – Index to pad tensors with.
  • dim (int, optional) – Dimension on to which to concatenate the batch of tensors.
Returns
BatchedSequences(torch.Tensor, torch.Tensor): Padded tensors and original lengths of
tensors.
class torchnlp.encoders.text.TextEncoder(enforce_reversible=False)[source]

Bases: torchnlp.encoders.encoder.Encoder

batch_decode(tensor, lengths, dim=0, *args, **kwargs)[source]
Parameters:
  • batch (list of torch.Tensor) – Batch of encoded sequences.
  • lengths (torch.Tensor) – Original lengths of sequences.
  • dim (int, optional) – Dimension along which to split tensors.
  • *args – Arguments passed to decode.
  • **kwargs – Key word arguments passed to decode.
Returns:

Batch of decoded sequences.

Return type:

list

batch_encode(iterator, *args, dim=0, **kwargs)[source]
Parameters:
  • iterator (iterator) – Batch of text to encode.
  • *args – Arguments passed onto Encoder.__init__.
  • dim (int, optional) – Dimension along which to concatenate tensors.
  • **kwargs – Keyword arguments passed onto Encoder.__init__.
Returns
torch.Tensor, torch.Tensor: Encoded and padded batch of sequences; Original lengths of
sequences.
decode(encoded)[source]

Decodes an object.

Parameters:object (object) – Encoded object.
Returns:Object decoded.
Return type:object
class torchnlp.encoders.text.SpacyEncoder(*args, **kwargs)[source]

Bases: torchnlp.encoders.text.static_tokenizer_encoder.StaticTokenizerEncoder

Encodes the text using spaCy’s tokenizer.

Tokenizer Reference: https://spacy.io/api/tokenizer

Parameters:
  • **args – Arguments passed onto StaticTokenizerEncoder.__init__.
  • language (string, optional) – Language to use for parsing. Accepted values are ‘en’, ‘de’, ‘es’, ‘pt’, ‘fr’, ‘it’, ‘nl’ and ‘xx’. For details see https://spacy.io/models/#available-models
  • **kwargs – Keyword arguments passed onto StaticTokenizerEncoder.__init__.

Example

>>> encoder = SpacyEncoder(["This ain't funny.", "Don't?"])
>>> encoder.encode("This ain't funny.")
tensor([5, 6, 7, 8, 9])
>>> encoder.vocab
['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'This', 'ai', "n't", 'funny', '.', 'Do', '?']
>>> encoder.decode(encoder.encode("This ain't funny."))
"This ai n't funny ."
batch_encode(sequences)[source]
Parameters:
  • iterator (iterator) – Batch of text to encode.
  • *args – Arguments passed onto Encoder.__init__.
  • dim (int, optional) – Dimension along which to concatenate tensors.
  • **kwargs – Keyword arguments passed onto Encoder.__init__.
Returns
torch.Tensor, torch.Tensor: Encoded and padded batch of sequences; Original lengths of
sequences.
class torchnlp.encoders.text.StaticTokenizerEncoder(sample, min_occurrences=1, append_eos=False, tokenize=<function _tokenize>, detokenize=<function _detokenize>, reserved_tokens=['<pad>', '<unk>', '</s>', '<s>', '<copy>'], eos_index=2, unknown_index=1, padding_index=0, **kwargs)[source]

Bases: torchnlp.encoders.text.text_encoder.TextEncoder

Encodes a text sequence using a static tokenizer.

Parameters:
  • sample (collections.abc.Iterable) – Sample of data used to build encoding dictionary.
  • min_occurrences (int, optional) – Minimum number of occurrences for a token to be added to the encoding dictionary.
  • tokenize (callable) – callable to tokenize a sequence.
  • detokenize (callable) – callable to detokenize a sequence.
  • append_eos (bool, optional) – If True append EOS token onto the end to the encoded vector.
  • reserved_tokens (list of str, optional) – List of reserved tokens inserted in the beginning of the dictionary.
  • eos_index (int, optional) – The eos token is used to encode the end of a sequence. This is the index that token resides at.
  • unknown_index (int, optional) – The unknown token is used to encode unseen tokens. This is the index that token resides at.
  • padding_index (int, optional) – The unknown token is used to encode sequence padding. This is the index that token resides at.
  • **kwargs – Keyword arguments passed onto TextEncoder.__init__.

Example

>>> sample = ["This ain't funny.", "Don't?"]
>>> encoder = StaticTokenizerEncoder(sample, tokenize=lambda s: s.split())
>>> encoder.encode("This ain't funny.")
tensor([5, 6, 7])
>>> encoder.vocab
['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'This', "ain't", 'funny.', "Don't?"]
>>> encoder.decode(encoder.encode("This ain't funny."))
"This ain't funny."
decode(encoded)[source]

Decodes a tensor into a sequence.

Parameters:encoded (torch.Tensor) – Encoded sequence.
Returns:Sequence decoded from encoded.
Return type:str
encode(sequence)[source]

Encodes a sequence.

Parameters:sequence (str) – String sequence to encode.
Returns:Encoding of the sequence.
Return type:torch.Tensor
vocab

List of tokens in the dictionary.

Type:Returns
Type:list
vocab_size

Number of tokens in the dictionary.

Type:Returns
Type:int
class torchnlp.encoders.text.SubwordEncoder(sample, append_eos=False, target_vocab_size=None, min_occurrences=1, max_occurrences=1000.0, reserved_tokens=['<pad>', '<unk>', '</s>', '<s>', '<copy>'], eos_index=2, unknown_index=1, padding_index=0, **kwargs)[source]

Bases: torchnlp.encoders.text.text_encoder.TextEncoder

Invertibly encoding text using a limited vocabulary.

Applies Googles Tensor2Tensor SubwordTextTokenizer that invertibly encodes a native string as a sequence of subtokens from a limited vocabulary. In order to build the vocabulary, it uses recursive binary search to find a minimum token count x (s.t. min_occurrences <= x <= max_occurrences) that most closely matches the target_size.

Tokenizer Reference: https://github.com/tensorflow/tensor2tensor/blob/8bdecbe434d93cb1e79c0489df20fee2d5a37dc2/tensor2tensor/data_generators/text_encoder.py#L389

Parameters:
  • sample (list) – Sample of data used to build encoding dictionary.
  • append_eos (bool, optional) – If True append EOS token onto the end to the encoded vector.
  • target_vocab_size (int, optional) – Desired size of vocab.
  • min_occurrences (int, optional) – Lower bound for the minimum token count.
  • max_occurrences (int, optional) – Upper bound for the minimum token count.
  • reserved_tokens (list of str, optional) – List of reserved tokens inserted in the beginning of the dictionary.
  • eos_index (int, optional) – The eos token is used to encode the end of a sequence. This is the index that token resides at.
  • unknown_index (int, optional) – The unknown token is used to encode unseen tokens. This is the index that token resides at.
  • padding_index (int, optional) – The padding token is used to encode sequence padding. This is the index that token resides at.
  • **kwargs – Keyword arguments passed onto TextEncoder.__init__.
decode(encoded)[source]

Decodes a tensor into a sequence.

Parameters:encoded (torch.Tensor) – Encoded sequence.
Returns:Sequence decoded from encoded.
Return type:str
encode(sequence)[source]

Encodes a sequence.

Parameters:sequence (str) – String sequence to encode.
Returns:Encoding of the sequence.
Return type:torch.Tensor
vocab

List of tokens in the dictionary.

Type:Returns
Type:list
vocab_size

Number of tokens in the dictionary.

Type:Returns
Type:int
class torchnlp.encoders.text.TreebankEncoder(*args, **kwargs)[source]

Bases: torchnlp.encoders.text.static_tokenizer_encoder.StaticTokenizerEncoder

Encodes text using the Treebank tokenizer.

Tokenizer Reference: http://www.nltk.org/_modules/nltk/tokenize/treebank.html

Parameters:
  • **args – Arguments passed onto StaticTokenizerEncoder.__init__.
  • **kwargs – Keyword arguments passed onto StaticTokenizerEncoder.__init__.

Example

>>> encoder = TreebankEncoder(["This ain't funny.", "Don't?"])
>>> encoder.encode("This ain't funny.")
tensor([5, 6, 7, 8, 9])
>>> encoder.vocab
['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'This', 'ai', "n't", 'funny', '.', 'Do', '?']
>>> encoder.decode(encoder.encode("This ain't funny."))
"This ain't funny."
class torchnlp.encoders.text.WhitespaceEncoder(*args, **kwargs)[source]

Bases: torchnlp.encoders.text.delimiter_encoder.DelimiterEncoder

Encodes a text by splitting on whitespace.

Parameters:
  • **args – Arguments passed onto DelimiterEncoder.__init__.
  • **kwargs – Keyword arguments passed onto DelimiterEncoder.__init__.

Example

>>> encoder = WhitespaceEncoder(["This ain't funny.", "Don't?"])
>>> encoder.encode("This ain't funny.")
tensor([5, 6, 7])
>>> encoder.vocab
['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'This', "ain't", 'funny.', "Don't?"]
>>> encoder.decode(encoder.encode("This ain't funny."))
"This ain't funny."
class torchnlp.encoders.text.BatchedSequences(tensor, lengths)

Bases: tuple

lengths

Alias for field number 1

tensor

Alias for field number 0