torchnlp.encoders package¶
The torchnlp.encoders
package supports encoding objects as a vector
torch.Tensor
and decoding a vector torch.Tensor
back.
-
class
torchnlp.encoders.
Encoder
(enforce_reversible=False)[source]¶ Bases:
object
Base class for a encoder employing an identity function.
Parameters: enforce_reversible (bool, optional) – Check for reversibility on Encoder.encode
andEncoder.decode
. Formally, reversible means:Encoder.decode(Encoder.encode(object_)) == object_
.-
batch_decode
(iterator, *args, **kwargs)[source]¶ Parameters: - iterator (list) – Batch of encoded objects.
- *args – Arguments passed to
decode
. - **kwargs – Keyword arguments passed to
decode
.
Returns: Batch of decoded objects.
Return type:
-
batch_encode
(iterator, *args, **kwargs)[source]¶ Parameters: - batch (list) – Batch of objects to encode.
- *args – Arguments passed to
encode
. - **kwargs – Keyword arguments passed to
encode
.
Returns: Batch of encoded objects.
Return type:
-
-
class
torchnlp.encoders.
LabelEncoder
(sample, min_occurrences=1, reserved_labels=['<unk>'], unknown_index=0, **kwargs)[source]¶ Bases:
torchnlp.encoders.encoder.Encoder
Encodes an label via a dictionary.
Parameters: - sample (list of strings) – Sample of data used to build encoding dictionary.
- min_occurrences (int, optional) – Minimum number of occurrences for a label to be added to the encoding dictionary.
- reserved_labels (list, optional) – List of reserved labels inserted in the beginning of the dictionary.
- unknown_index (int, optional) – The unknown label is used to encode unseen labels. This is the index that label resides at.
- **kwargs – Keyword arguments passed onto
Encoder
.
Example
>>> samples = ['label_a', 'label_b'] >>> encoder = LabelEncoder(samples, reserved_labels=['unknown'], unknown_index=0) >>> encoder.encode('label_a') tensor(1) >>> encoder.decode(encoder.encode('label_a')) 'label_a' >>> encoder.encode('label_c') tensor(0) >>> encoder.decode(encoder.encode('label_c')) 'unknown' >>> encoder.vocab ['unknown', 'label_a', 'label_b']
-
batch_decode
(tensor, *args, dim=0, **kwargs)[source]¶ Parameters: - tensor (torch.Tensor) – Batch of tensors.
- *args – Arguments passed to
Encoder.batch_decode
. - dim (int, optional) – Dimension along which to split tensors.
- **kwargs – Keyword arguments passed to
Encoder.batch_decode
.
Returns: Batch of decoded labels.
Return type:
-
batch_encode
(iterator, *args, dim=0, **kwargs)[source]¶ Parameters: - iterator (iterator) – Batch of labels to encode.
- *args – Arguments passed to
Encoder.batch_encode
. - dim (int, optional) – Dimension along which to concatenate tensors.
- **kwargs – Keyword arguments passed to
Encoder.batch_encode
.
Returns: Tensor of encoded labels.
Return type: torch.Tensor
-
decode
(encoded)[source]¶ Decodes
encoded
label.Parameters: encoded (torch.Tensor) – Encoded label. Returns: Label decoded from encoded
.Return type: object
-
class
torchnlp.encoders.text.
CharacterEncoder
(*args, **kwargs)[source]¶ Bases:
torchnlp.encoders.text.static_tokenizer_encoder.StaticTokenizerEncoder
Encodes text into a tensor by splitting the text into individual characters.
Parameters: - **args – Arguments passed onto
StaticTokenizerEncoder.__init__
. - **kwargs – Keyword arguments passed onto
StaticTokenizerEncoder.__init__
.
- **args – Arguments passed onto
-
class
torchnlp.encoders.text.
DelimiterEncoder
(delimiter, *args, **kwargs)[source]¶ Bases:
torchnlp.encoders.text.static_tokenizer_encoder.StaticTokenizerEncoder
Encodes text into a tensor by splitting the text using a delimiter.
Parameters: - delimiter (string) – Delimiter used with
string.split
. - **args – Arguments passed onto
StaticTokenizerEncoder.__init__
. - **kwargs – Keyword arguments passed onto
StaticTokenizerEncoder.__init__
.
Example
>>> encoder = DelimiterEncoder('|', ['token_a|token_b', 'token_c']) >>> encoder.encode('token_a|token_c') tensor([5, 7]) >>> encoder.vocab ['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'token_a', 'token_b', 'token_c']
- delimiter (string) – Delimiter used with
-
class
torchnlp.encoders.text.
MosesEncoder
(*args, **kwargs)[source]¶ Bases:
torchnlp.encoders.text.static_tokenizer_encoder.StaticTokenizerEncoder
Encodes the text using the Moses tokenizer.
Tokenizer Reference: http://www.nltk.org/_modules/nltk/tokenize/moses.html
Parameters: - **args – Arguments passed onto
StaticTokenizerEncoder.__init__
. - **kwargs – Keyword arguments passed onto
StaticTokenizerEncoder.__init__
.
NOTE: The doctest is skipped because running NLTK moses with Python 3.7’s pytest halts on travis.
Example
>>> encoder = MosesEncoder(["This ain't funny.", "Don't?"]) # doctest: +SKIP >>> encoder.encode("This ain't funny.") # doctest: +SKIP tensor([5, 6, 7, 8, 9]) >>> encoder.vocab # doctest: +SKIP ['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'This', 'ain', ''t', 'funny', '.', 'Don', '?'] >>> encoder.decode(encoder.encode("This ain't funny.")) # doctest: +SKIP "This ain't funny."
- **args – Arguments passed onto
-
torchnlp.encoders.text.
pad_tensor
(tensor, length, padding_index=0)[source]¶ Pad a
tensor
tolength
withpadding_index
.Parameters: - Returns
- (torch.Tensor [length, …]) Padded Tensor.
-
torchnlp.encoders.text.
stack_and_pad_tensors
(batch, padding_index=0, dim=0)[source]¶ Pad a
list
oftensors
(batch
) withpadding_index
.Parameters: - Returns
- SequenceBatch: Padded tensors and original lengths of tensors.
-
class
torchnlp.encoders.text.
TextEncoder
(enforce_reversible=False)[source]¶ Bases:
torchnlp.encoders.encoder.Encoder
-
batch_decode
(tensor, lengths, dim=0, *args, **kwargs)[source]¶ Parameters: - batch (list of
torch.Tensor
) – Batch of encoded sequences. - lengths (torch.Tensor) – Original lengths of sequences.
- dim (int, optional) – Dimension along which to split tensors.
- *args – Arguments passed to
decode
. - **kwargs – Key word arguments passed to
decode
.
Returns: Batch of decoded sequences.
Return type: - batch (list of
-
batch_encode
(iterator, *args, dim=0, **kwargs)[source]¶ Parameters: - iterator (iterator) – Batch of text to encode.
- *args – Arguments passed onto
Encoder.__init__
. - dim (int, optional) – Dimension along which to concatenate tensors.
- **kwargs – Keyword arguments passed onto
Encoder.__init__
.
- Returns
- torch.Tensor, torch.Tensor: Encoded and padded batch of sequences; Original lengths of
- sequences.
-
-
class
torchnlp.encoders.text.
SpacyEncoder
(*args, **kwargs)[source]¶ Bases:
torchnlp.encoders.text.static_tokenizer_encoder.StaticTokenizerEncoder
Encodes the text using spaCy’s tokenizer.
Tokenizer Reference: https://spacy.io/api/tokenizer
Parameters: - **args – Arguments passed onto
StaticTokenizerEncoder.__init__
. - language (string, optional) – Language to use for parsing. Accepted values are ‘en’, ‘de’, ‘es’, ‘pt’, ‘fr’, ‘it’, ‘nl’ and ‘xx’. For details see https://spacy.io/models/#available-models
- **kwargs – Keyword arguments passed onto
StaticTokenizerEncoder.__init__
.
Example
>>> encoder = SpacyEncoder(["This ain't funny.", "Don't?"]) >>> encoder.encode("This ain't funny.") tensor([5, 6, 7, 8, 9]) >>> encoder.vocab ['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'This', 'ai', "n't", 'funny', '.', 'Do', '?'] >>> encoder.decode(encoder.encode("This ain't funny.")) "This ai n't funny ."
-
batch_encode
(sequences)[source]¶ Parameters: - iterator (iterator) – Batch of text to encode.
- *args – Arguments passed onto
Encoder.__init__
. - dim (int, optional) – Dimension along which to concatenate tensors.
- **kwargs – Keyword arguments passed onto
Encoder.__init__
.
- Returns
- torch.Tensor, torch.Tensor: Encoded and padded batch of sequences; Original lengths of
- sequences.
- **args – Arguments passed onto
-
class
torchnlp.encoders.text.
StaticTokenizerEncoder
(sample, min_occurrences=1, append_sos=False, append_eos=False, tokenize=<function _tokenize>, detokenize=<function _detokenize>, reserved_tokens=['<pad>', '<unk>', '</s>', '<s>', '<copy>'], sos_index=3, eos_index=2, unknown_index=1, padding_index=0, **kwargs)[source]¶ Bases:
torchnlp.encoders.text.text_encoder.TextEncoder
Encodes a text sequence using a static tokenizer.
Parameters: - sample (collections.abc.Iterable) – Sample of data used to build encoding dictionary.
- min_occurrences (int, optional) – Minimum number of occurrences for a token to be added to the encoding dictionary.
- tokenize (callable) –
callable
to tokenize a sequence. - detokenize (callable) –
callable
to detokenize a sequence. - append_sos (bool, optional) – If
True
insert SOS token at the start of the encoded vector. - append_eos (bool, optional) – If
True
append EOS token onto the end to the encoded vector. - reserved_tokens (list of str, optional) – List of reserved tokens inserted in the beginning of the dictionary.
- sos_index (int, optional) – The sos token is used to encode the start of a sequence. This is the index that token resides at.
- eos_index (int, optional) – The eos token is used to encode the end of a sequence. This is the index that token resides at.
- unknown_index (int, optional) – The unknown token is used to encode unseen tokens. This is the index that token resides at.
- padding_index (int, optional) – The unknown token is used to encode sequence padding. This is the index that token resides at.
- **kwargs – Keyword arguments passed onto
TextEncoder.__init__
.
Example
>>> sample = ["This ain't funny.", "Don't?"] >>> encoder = StaticTokenizerEncoder(sample, tokenize=lambda s: s.split()) >>> encoder.encode("This ain't funny.") tensor([5, 6, 7]) >>> encoder.vocab ['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'This', "ain't", 'funny.', "Don't?"] >>> encoder.decode(encoder.encode("This ain't funny.")) "This ain't funny."
-
decode
(encoded)[source]¶ Decodes a tensor into a sequence.
Parameters: encoded (torch.Tensor) – Encoded sequence. Returns: Sequence decoded from encoded
.Return type: str
-
class
torchnlp.encoders.text.
SubwordEncoder
(sample, append_sos=False, append_eos=False, target_vocab_size=None, min_occurrences=1, max_occurrences=1000.0, reserved_tokens=['<pad>', '<unk>', '</s>', '<s>', '<copy>'], sos_index=3, eos_index=2, unknown_index=1, padding_index=0, **kwargs)[source]¶ Bases:
torchnlp.encoders.text.text_encoder.TextEncoder
Invertibly encoding text using a limited vocabulary.
Applies Googles Tensor2Tensor
SubwordTextTokenizer
that invertibly encodes a native string as a sequence of subtokens from a limited vocabulary. In order to build the vocabulary, it uses recursive binary search to find a minimum token count x (s.t. min_occurrences <= x <= max_occurrences) that most closely matches the target_size.Tokenizer Reference: https://github.com/tensorflow/tensor2tensor/blob/8bdecbe434d93cb1e79c0489df20fee2d5a37dc2/tensor2tensor/data_generators/text_encoder.py#L389
Parameters: - sample (list) – Sample of data used to build encoding dictionary.
- append_sos (bool, optional) – If
True
insert SOS token at the start of the encoded vector. - append_eos (bool, optional) – If
True
append EOS token onto the end to the encoded vector. - target_vocab_size (int, optional) – Desired size of vocab.
- min_occurrences (int, optional) – Lower bound for the minimum token count.
- max_occurrences (int, optional) – Upper bound for the minimum token count.
- reserved_tokens (list of str, optional) – List of reserved tokens inserted in the beginning of the dictionary.
- sos_index (int, optional) – The sos token is used to encode the start of a sequence. This is the index that token resides at.
- eos_index (int, optional) – The eos token is used to encode the end of a sequence. This is the index that token resides at.
- unknown_index (int, optional) – The unknown token is used to encode unseen tokens. This is the index that token resides at.
- padding_index (int, optional) – The padding token is used to encode sequence padding. This is the index that token resides at.
- **kwargs – Keyword arguments passed onto
TextEncoder.__init__
.
-
decode
(encoded)[source]¶ Decodes a tensor into a sequence.
Parameters: encoded (torch.Tensor) – Encoded sequence. Returns: Sequence decoded from encoded
.Return type: str
-
class
torchnlp.encoders.text.
TreebankEncoder
(*args, **kwargs)[source]¶ Bases:
torchnlp.encoders.text.static_tokenizer_encoder.StaticTokenizerEncoder
Encodes text using the Treebank tokenizer.
Tokenizer Reference: http://www.nltk.org/_modules/nltk/tokenize/treebank.html
Parameters: - **args – Arguments passed onto
StaticTokenizerEncoder.__init__
. - **kwargs – Keyword arguments passed onto
StaticTokenizerEncoder.__init__
.
Example
>>> encoder = TreebankEncoder(["This ain't funny.", "Don't?"]) >>> encoder.encode("This ain't funny.") tensor([5, 6, 7, 8, 9]) >>> encoder.vocab ['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'This', 'ai', "n't", 'funny', '.', 'Do', '?'] >>> encoder.decode(encoder.encode("This ain't funny.")) "This ain't funny."
- **args – Arguments passed onto
-
class
torchnlp.encoders.text.
WhitespaceEncoder
(*args, **kwargs)[source]¶ Bases:
torchnlp.encoders.text.delimiter_encoder.DelimiterEncoder
Encodes a text by splitting on whitespace.
Parameters: - **args – Arguments passed onto
DelimiterEncoder.__init__
. - **kwargs – Keyword arguments passed onto
DelimiterEncoder.__init__
.
Example
>>> encoder = WhitespaceEncoder(["This ain't funny.", "Don't?"]) >>> encoder.encode("This ain't funny.") tensor([5, 6, 7]) >>> encoder.vocab ['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'This', "ain't", 'funny.', "Don't?"] >>> encoder.decode(encoder.encode("This ain't funny.")) "This ain't funny."
- **args – Arguments passed onto