torchnlp.encoders package¶

The torchnlp.encoders package supports encoding objects as a vector torch.Tensor and decoding a vector torch.Tensor back.

class torchnlp.encoders.Encoder(enforce_reversible=False)[source]¶

Bases: object

Base class for a encoder employing an identity function.

Parameters:	enforce_reversible (bool, optional) – Check for reversibility on `Encoder.encode` and `Encoder.decode`. Formally, reversible means: `Encoder.decode(Encoder.encode(object_)) == object_`.

batch_decode(iterator, *args, **kwargs)[source]¶

Parameters:	iterator (list) – Batch of encoded objects. args – Arguments passed to `decode`. *kwargs – Keyword arguments passed to `decode`.
Returns:	Batch of decoded objects.
Return type:	list

batch_encode(iterator, *args, **kwargs)[source]¶

Parameters:	batch (list) – Batch of objects to encode. args – Arguments passed to `encode`. *kwargs – Keyword arguments passed to `encode`.
Returns:	Batch of encoded objects.
Return type:	list

decode(encoded)[source]¶

Decodes an object.

Parameters:	object (object) – Encoded object.
Returns:	Object decoded.
Return type:	object

encode(object_)[source]¶

Encodes an object.

Parameters:	object (object) – Object to encode.
Returns:	Encoding of the object.
Return type:	object

class torchnlp.encoders.LabelEncoder(sample, min_occurrences=1, reserved_labels=['<unk>'], unknown_index=0, **kwargs)[source]¶

Bases: torchnlp.encoders.encoder.Encoder

Encodes an label via a dictionary.

Parameters:

sample (list of strings) – Sample of data used to build encoding dictionary.
min_occurrences (int, optional) – Minimum number of occurrences for a label to be added to the encoding dictionary.
reserved_labels (list, optional) – List of reserved labels inserted in the beginning of the dictionary.
unknown_index (int, optional) – The unknown label is used to encode unseen labels. This is the index that label resides at.
**kwargs – Keyword arguments passed onto Encoder.

Example

>>> samples = ['label_a', 'label_b']
>>> encoder = LabelEncoder(samples, reserved_labels=['unknown'], unknown_index=0)
>>> encoder.encode('label_a')
tensor(1)
>>> encoder.decode(encoder.encode('label_a'))
'label_a'
>>> encoder.encode('label_c')
tensor(0)
>>> encoder.decode(encoder.encode('label_c'))
'unknown'
>>> encoder.vocab
['unknown', 'label_a', 'label_b']

batch_decode(tensor, *args, dim=0, **kwargs)[source]¶

Parameters:	tensor (torch.Tensor) – Batch of tensors. args – Arguments passed to `Encoder.batch_decode`. dim* (int, optional) – Dimension along which to split tensors. **kwargs – Keyword arguments passed to `Encoder.batch_decode`.
Returns:	Batch of decoded labels.
Return type:	list

batch_encode(iterator, *args, dim=0, **kwargs)[source]¶

Parameters:	iterator (iterator) – Batch of labels to encode. args – Arguments passed to `Encoder.batch_encode`. dim* (int, optional) – Dimension along which to concatenate tensors. **kwargs – Keyword arguments passed to `Encoder.batch_encode`.
Returns:	Tensor of encoded labels.
Return type:	torch.Tensor

decode(encoded)[source]¶

Decodes encoded label.

Parameters:	encoded (torch.Tensor) – Encoded label.
Returns:	Label decoded from `encoded`.
Return type:	object

encode(label)[source]¶

Encodes a label.

Parameters:	label (object) – Label to encode.
Returns:	Encoding of the label.
Return type:	torch.Tensor

vocab¶

List of labels in the dictionary.

Type:	Returns
Type:	list

vocab_size¶

Number of labels in the dictionary.

Type:	Returns
Type:	int

class torchnlp.encoders.text.CharacterEncoder(*args, **kwargs)[source]¶

Bases: torchnlp.encoders.text.static_tokenizer_encoder.StaticTokenizerEncoder

Encodes text into a tensor by splitting the text into individual characters.

Parameters:	args – Arguments passed onto `StaticTokenizerEncoder.__init__`. kwargs – Keyword arguments passed onto `StaticTokenizerEncoder.__init__`.

class torchnlp.encoders.text.DelimiterEncoder(delimiter, *args, **kwargs)[source]¶

Bases: torchnlp.encoders.text.static_tokenizer_encoder.StaticTokenizerEncoder

Encodes text into a tensor by splitting the text using a delimiter.

Parameters:	delimiter (string) – Delimiter used with `string.split`. args – Arguments passed onto `StaticTokenizerEncoder.__init__`. kwargs – Keyword arguments passed onto `StaticTokenizerEncoder.__init__`.

Example

>>> encoder = DelimiterEncoder('|', ['token_a|token_b', 'token_c'])
>>> encoder.encode('token_a|token_c')
tensor([5, 7])
>>> encoder.vocab
['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'token_a', 'token_b', 'token_c']

class torchnlp.encoders.text.MosesEncoder(*args, **kwargs)[source]¶

Bases: torchnlp.encoders.text.static_tokenizer_encoder.StaticTokenizerEncoder

Encodes the text using the Moses tokenizer.

Tokenizer Reference: http://www.nltk.org/_modules/nltk/tokenize/moses.html

Parameters:	args – Arguments passed onto `StaticTokenizerEncoder.__init__`. kwargs – Keyword arguments passed onto `StaticTokenizerEncoder.__init__`.

NOTE: The doctest is skipped because running NLTK moses with Python 3.7’s pytest halts on travis.

Example

>>> encoder = MosesEncoder(["This ain't funny.", "Don't?"]) # doctest: +SKIP
>>> encoder.encode("This ain't funny.") # doctest: +SKIP
tensor([5, 6, 7, 8, 9])
>>> encoder.vocab # doctest: +SKIP
['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'This', 'ain', '&apos;t', 'funny', '.', 'Don', '?']
>>> encoder.decode(encoder.encode("This ain't funny.")) # doctest: +SKIP
"This ain't funny."

torchnlp.encoders.text.pad_tensor(tensor, length, padding_index=0)[source]¶

Pad a tensor to length with padding_index.

Parameters:	tensor (torch.Tensor [n, ..]) – Tensor to pad. length (int) – Pad the `tensor` up to `length`. padding_index (int, optional) – Index to pad tensor with.

Returns: (torch.Tensor [length, …]) Padded Tensor.

torchnlp.encoders.text.stack_and_pad_tensors(batch, padding_index=0, dim=0)[source]¶

Pad a list of tensors (batch) with padding_index.

Parameters:	batch (`list` of `torch.Tensor`) – Batch of tensors to pad. padding_index (int, optional) – Index to pad tensors with. dim (int, optional) – Dimension on to which to concatenate the batch of tensors.

Returns

BatchedSequences(torch.Tensor, torch.Tensor): Padded tensors and original lengths of: tensors.

class torchnlp.encoders.text.TextEncoder(enforce_reversible=False)[source]¶

Bases: torchnlp.encoders.encoder.Encoder

batch_decode(tensor, lengths, dim=0, *args, **kwargs)[source]¶

Parameters:	batch (list of `torch.Tensor`) – Batch of encoded sequences. lengths (torch.Tensor) – Original lengths of sequences. dim (int, optional) – Dimension along which to split tensors. args – Arguments passed to `decode`. *kwargs – Key word arguments passed to `decode`.
Returns:	Batch of decoded sequences.
Return type:	list

batch_encode(iterator, *args, dim=0, **kwargs)[source]¶

Parameters:	iterator (iterator) – Batch of text to encode. args – Arguments passed onto `Encoder.__init__`. dim* (int, optional) – Dimension along which to concatenate tensors. **kwargs – Keyword arguments passed onto `Encoder.__init__`.

Returns

torch.Tensor, torch.Tensor: Encoded and padded batch of sequences; Original lengths of: sequences.

decode(encoded)[source]¶

Decodes an object.

Parameters:	object (object) – Encoded object.
Returns:	Object decoded.
Return type:	object

class torchnlp.encoders.text.SpacyEncoder(*args, **kwargs)[source]¶

Bases: torchnlp.encoders.text.static_tokenizer_encoder.StaticTokenizerEncoder

Encodes the text using spaCy’s tokenizer.

Tokenizer Reference: https://spacy.io/api/tokenizer

Parameters:	args – Arguments passed onto `StaticTokenizerEncoder.__init__`. language** (string, optional) – Language to use for parsing. Accepted values are ‘en’, ‘de’, ‘es’, ‘pt’, ‘fr’, ‘it’, ‘nl’ and ‘xx’. For details see https://spacy.io/models/#available-models **kwargs – Keyword arguments passed onto `StaticTokenizerEncoder.__init__`.

Example

>>> encoder = SpacyEncoder(["This ain't funny.", "Don't?"])
>>> encoder.encode("This ain't funny.")
tensor([5, 6, 7, 8, 9])
>>> encoder.vocab
['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'This', 'ai', "n't", 'funny', '.', 'Do', '?']
>>> encoder.decode(encoder.encode("This ain't funny."))
"This ai n't funny ."

batch_encode(sequences)[source]¶

Parameters:	iterator (iterator) – Batch of text to encode. args – Arguments passed onto `Encoder.__init__`. dim* (int, optional) – Dimension along which to concatenate tensors. **kwargs – Keyword arguments passed onto `Encoder.__init__`.

Returns

torch.Tensor, torch.Tensor: Encoded and padded batch of sequences; Original lengths of: sequences.

class torchnlp.encoders.text.StaticTokenizerEncoder(sample, min_occurrences=1, append_eos=False, tokenize=<function _tokenize>, detokenize=<function _detokenize>, reserved_tokens=['<pad>', '<unk>', '</s>', '<s>', '<copy>'], eos_index=2, unknown_index=1, padding_index=0, **kwargs)[source]¶

Bases: torchnlp.encoders.text.text_encoder.TextEncoder

Encodes a text sequence using a static tokenizer.

Parameters:

sample (collections.abc.Iterable) – Sample of data used to build encoding dictionary.
min_occurrences (int, optional) – Minimum number of occurrences for a token to be added to the encoding dictionary.
tokenize (callable) – callable to tokenize a sequence.
detokenize (callable) – callable to detokenize a sequence.
append_eos (bool, optional) – If True append EOS token onto the end to the encoded vector.
reserved_tokens (list of str, optional) – List of reserved tokens inserted in the beginning of the dictionary.
eos_index (int, optional) – The eos token is used to encode the end of a sequence. This is the index that token resides at.
unknown_index (int, optional) – The unknown token is used to encode unseen tokens. This is the index that token resides at.
padding_index (int, optional) – The unknown token is used to encode sequence padding. This is the index that token resides at.
**kwargs – Keyword arguments passed onto TextEncoder.__init__.

Example

>>> sample = ["This ain't funny.", "Don't?"]
>>> encoder = StaticTokenizerEncoder(sample, tokenize=lambda s: s.split())
>>> encoder.encode("This ain't funny.")
tensor([5, 6, 7])
>>> encoder.vocab
['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'This', "ain't", 'funny.', "Don't?"]
>>> encoder.decode(encoder.encode("This ain't funny."))
"This ain't funny."

decode(encoded)[source]¶

Decodes a tensor into a sequence.

Parameters:	encoded (torch.Tensor) – Encoded sequence.
Returns:	Sequence decoded from `encoded`.
Return type:	str

encode(sequence)[source]¶

Encodes a sequence.

Parameters:	sequence (str) – String `sequence` to encode.
Returns:	Encoding of the `sequence`.
Return type:	torch.Tensor

vocab¶

List of tokens in the dictionary.

Type:	Returns
Type:	list

vocab_size¶

Number of tokens in the dictionary.

Type:	Returns
Type:	int

class torchnlp.encoders.text.SubwordEncoder(sample, append_eos=False, target_vocab_size=None, min_occurrences=1, max_occurrences=1000.0, reserved_tokens=['<pad>', '<unk>', '</s>', '<s>', '<copy>'], eos_index=2, unknown_index=1, padding_index=0, **kwargs)[source]¶

Bases: torchnlp.encoders.text.text_encoder.TextEncoder

Invertibly encoding text using a limited vocabulary.

Applies Googles Tensor2Tensor SubwordTextTokenizer that invertibly encodes a native string as a sequence of subtokens from a limited vocabulary. In order to build the vocabulary, it uses recursive binary search to find a minimum token count x (s.t. min_occurrences <= x <= max_occurrences) that most closely matches the target_size.

Tokenizer Reference: https://github.com/tensorflow/tensor2tensor/blob/8bdecbe434d93cb1e79c0489df20fee2d5a37dc2/tensor2tensor/data_generators/text_encoder.py#L389

Parameters:

sample (list) – Sample of data used to build encoding dictionary.
append_eos (bool, optional) – If True append EOS token onto the end to the encoded vector.
target_vocab_size (int, optional) – Desired size of vocab.
min_occurrences (int, optional) – Lower bound for the minimum token count.
max_occurrences (int, optional) – Upper bound for the minimum token count.
reserved_tokens (list of str, optional) – List of reserved tokens inserted in the beginning of the dictionary.
eos_index (int, optional) – The eos token is used to encode the end of a sequence. This is the index that token resides at.
unknown_index (int, optional) – The unknown token is used to encode unseen tokens. This is the index that token resides at.
padding_index (int, optional) – The padding token is used to encode sequence padding. This is the index that token resides at.
**kwargs – Keyword arguments passed onto TextEncoder.__init__.

decode(encoded)[source]¶

Decodes a tensor into a sequence.

Parameters:	encoded (torch.Tensor) – Encoded sequence.
Returns:	Sequence decoded from `encoded`.
Return type:	str

encode(sequence)[source]¶

Encodes a sequence.

Parameters:	sequence (str) – String `sequence` to encode.
Returns:	Encoding of the `sequence`.
Return type:	torch.Tensor

vocab¶

List of tokens in the dictionary.

Type:	Returns
Type:	list

vocab_size¶

Number of tokens in the dictionary.

Type:	Returns
Type:	int

class torchnlp.encoders.text.TreebankEncoder(*args, **kwargs)[source]¶

Bases: torchnlp.encoders.text.static_tokenizer_encoder.StaticTokenizerEncoder

Encodes text using the Treebank tokenizer.

Tokenizer Reference: http://www.nltk.org/_modules/nltk/tokenize/treebank.html

Parameters:	args – Arguments passed onto `StaticTokenizerEncoder.__init__`. kwargs – Keyword arguments passed onto `StaticTokenizerEncoder.__init__`.

Example

>>> encoder = TreebankEncoder(["This ain't funny.", "Don't?"])
>>> encoder.encode("This ain't funny.")
tensor([5, 6, 7, 8, 9])
>>> encoder.vocab
['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'This', 'ai', "n't", 'funny', '.', 'Do', '?']
>>> encoder.decode(encoder.encode("This ain't funny."))
"This ain't funny."

class torchnlp.encoders.text.WhitespaceEncoder(*args, **kwargs)[source]¶

Bases: torchnlp.encoders.text.delimiter_encoder.DelimiterEncoder

Encodes a text by splitting on whitespace.

Parameters:	args – Arguments passed onto `DelimiterEncoder.__init__`. kwargs – Keyword arguments passed onto `DelimiterEncoder.__init__`.

Example

>>> encoder = WhitespaceEncoder(["This ain't funny.", "Don't?"])
>>> encoder.encode("This ain't funny.")
tensor([5, 6, 7])
>>> encoder.vocab
['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'This', "ain't", 'funny.', "Don't?"]
>>> encoder.decode(encoder.encode("This ain't funny."))
"This ain't funny."

class torchnlp.encoders.text.BatchedSequences(tensor, lengths)¶

Bases: tuple

lengths¶: Alias for field number 1

tensor¶: Alias for field number 0