torchnlp.samplers package¶

The torchnlp.samplers package introduces a set of samplers. Samplers sample elements from a dataset. torchnlp.samplers plug into torch.utils.data.distributed.DistributedSampler and torch.utils.data.DataLoader.

class torchnlp.samplers.BalancedSampler(data_source, get_class=<function identity>, get_weight=<function BalancedSampler.<lambda>>, **kwargs)[source]¶

Weighted sampler with respect for an element’s class.

Parameters:	data (iterable) – get_class (callable, optional) – Get the class of an item relative to the entire dataset. get_weight (callable, optional) – Define a weight for each item other than one. kwargs – Additional key word arguments passed onto WeightedRandomSampler.

Example

>>> from torchnlp.samplers import DeterministicSampler
>>>
>>> data = ['a', 'b', 'c'] + ['c'] * 100
>>> sampler = BalancedSampler(data, num_samples=3)
>>> sampler = DeterministicSampler(sampler, random_seed=12)
>>> [data[i] for i in sampler]
['c', 'b', 'a']

class torchnlp.samplers.BPTTBatchSampler(data, bptt_length, batch_size, drop_last, type_='source')[source]¶

Samples sequentially a batch of source and target slices of size bptt_length.

Typically, such a sampler, is used for language modeling training with backpropagation through time (BPTT).

Reference: https://github.com/pytorch/examples/blob/c66593f1699ece14a4a2f4d314f1afb03c6793d9/word_language_model/main.py#L61

Parameters:	data (iterable) – bptt_length (int) – Length of the slice. batch_size (int) – Size of mini-batch. drop_last (bool) – If `True`, the sampler will drop the last batch if its size would be less than `batch_size`. type (str, optional) – Type of batch [‘source’\|’target’] to load where a target batch is one timestep ahead.

Example

>>> sampler = BPTTBatchSampler(range(100), bptt_length=2, batch_size=3, drop_last=False)
>>> list(sampler)[0] # First Batch
[slice(0, 2, None), slice(34, 36, None), slice(67, 69, None)]

class torchnlp.samplers.BPTTSampler(data, bptt_length, type_='source')[source]¶

Samples sequentially source and target slices of size bptt_length.

Typically, such a sampler, is used for language modeling training with backpropagation through time (BPTT).

Reference: https://github.com/pytorch/examples/blob/c66593f1699ece14a4a2f4d314f1afb03c6793d9/word_language_model/main.py#L122

Parameters:	data (iterable) – Iterable data. bptt_length (int) – Length of the slice. type (str, optional) – Type of slice [‘source’\|’target’] to load where a target slice is one timestep ahead

Example

>>> from torchnlp.samplers import BPTTSampler
>>> list(BPTTSampler(range(5), 2))
[slice(0, 2, None), slice(2, 4, None)]

class torchnlp.samplers.BucketBatchSampler(sampler, batch_size, drop_last, sort_key=<function identity>, bucket_size_multiplier=100)[source]¶

BucketBatchSampler toggles between sampler batches and sorted batches.

Typically, the sampler will be a RandomSampler allowing the user to toggle between random batches and sorted batches. A larger bucket_size_multiplier is more sorted and vice versa.

Background:

BucketBatchSampler is similar to a BucketIterator found in popular libraries like AllenNLP and torchtext. A BucketIterator pools together examples with a similar size length to reduce the padding required for each batch while maintaining some noise through bucketing.

AllenNLP Implementation: https://github.com/allenai/allennlp/blob/master/allennlp/data/iterators/bucket_iterator.py

torchtext Implementation: https://github.com/pytorch/text/blob/master/torchtext/data/iterator.py#L225

Parameters:

sampler (torch.data.utils.sampler.Sampler) –
batch_size (int) – Size of mini-batch.
drop_last (bool) – If True the sampler will drop the last batch if its size would be less than batch_size.
sort_key (callable, optional) – Callable to specify a comparison key for sorting.
bucket_size_multiplier (int, optional) – Buckets are of size batch_size * bucket_size_multiplier.

Example

>>> from torchnlp.random import set_seed
>>> set_seed(123)
>>>
>>> from torch.utils.data.sampler import SequentialSampler
>>> sampler = SequentialSampler(list(range(10)))
>>> list(BucketBatchSampler(sampler, batch_size=3, drop_last=False))
[[6, 7, 8], [0, 1, 2], [3, 4, 5], [9]]
>>> list(BucketBatchSampler(sampler, batch_size=3, drop_last=True))
[[0, 1, 2], [3, 4, 5], [6, 7, 8]]

class torchnlp.samplers.DeterministicSampler(sampler, random_seed, cuda=False)[source]¶

Maintains a random state such that sampler returns the same output every process.

Parameters:	sampler (torch.data.utils.sampler.Sampler) – random_seed (int) – cuda (bool, optional) – If True this sampler forks the random state of CUDA as well.

class torchnlp.samplers.DistributedBatchSampler(batch_sampler, **kwargs)[source]¶

BatchSampler wrapper that distributes across each batch multiple workers.

Parameters:	batch_sampler (torch.utils.data.sampler.BatchSampler) – num_replicas (int, optional) – Number of processes participating in distributed training. rank (int, optional) – Rank of the current process within num_replicas.

Example

>>> from torch.utils.data.sampler import BatchSampler
>>> from torch.utils.data.sampler import SequentialSampler
>>> sampler = SequentialSampler(list(range(12)))
>>> batch_sampler = BatchSampler(sampler, batch_size=4, drop_last=False)
>>>
>>> list(DistributedBatchSampler(batch_sampler, num_replicas=2, rank=0))
[[0, 2], [4, 6], [8, 10]]
>>> list(DistributedBatchSampler(batch_sampler, num_replicas=2, rank=1))
[[1, 3], [5, 7], [9, 11]]

class torchnlp.samplers.DistributedSampler(iterable, num_replicas=None, rank=None)[source]¶

Iterable wrapper that distributes data across multiple workers.

Parameters:	iterable (iterable) – num_replicas (int, optional) – Number of processes participating in distributed training. rank (int, optional) – Rank of the current process within `num_replicas`.

Example

>>> list(DistributedSampler(range(10), num_replicas=2, rank=0))
[0, 2, 4, 6, 8]
>>> list(DistributedSampler(range(10), num_replicas=2, rank=1))
[1, 3, 5, 7, 9]

torchnlp.samplers.get_number_of_elements(object_)[source]¶

Get the sum of the number of elements in all tensors stored in object_.

This is particularly useful for sampling the largest objects based on tensor size like in: OomBatchSampler.__init__.get_item_size.

Parameters:	object (any) –
Returns:	The number of elements in the object_.
Return type:	(int)

class torchnlp.samplers.NoisySortedSampler(data, sort_key=<function identity>, get_noise=<function _uniform_noise>)[source]¶

Samples elements sequentially with noise.

Background

NoisySortedSampler is similar to a BucketIterator found in popular libraries like AllenNLP and torchtext. A BucketIterator pools together examples with a similar size length to reduce the padding required for each batch. BucketIterator also includes the ability to add noise to the pooling.

AllenNLP Implementation: https://github.com/allenai/allennlp/blob/e125a490b71b21e914af01e70e9b00b165d64dcd/allennlp/data/iterators/bucket_iterator.py

torchtext Implementation: https://github.com/pytorch/text/blob/master/torchtext/data/iterator.py#L225

Parameters:	data (iterable) – Data to sample from. sort_key (callable) – Specifies a function of one argument that is used to extract a numerical comparison key from each list element. get_noise (callable) – Noise added to each numerical `sort_key`.

Example

>>> from torchnlp.random import set_seed
>>> set_seed(123)
>>>
>>> import random
>>> get_noise = lambda i: round(random.uniform(-1, 1))
>>> list(NoisySortedSampler(range(10), sort_key=lambda i: i, get_noise=get_noise))
[0, 1, 2, 3, 5, 4, 6, 7, 9, 8]

class torchnlp.samplers.OomBatchSampler(batch_sampler, get_item_size, num_batches=5)[source]¶

Out-of-memory (OOM) batch sampler wraps batch_sampler to sample the num_batches largest batches first in attempt to cause any potential OOM errors early.

Credits: https://github.com/allenai/allennlp/blob/3d100d31cc8d87efcf95c0b8d162bfce55c64926/allennlp/data/iterators/bucket_iterator.py#L43

Parameters:	batch_sampler (torch.utils.data.sampler.BatchSampler) – get_item_size (callable) – Measure the size of an item given it’s index int. num_batches (int, optional) – The number of the large batches to move to the beginning of the iteration.

class torchnlp.samplers.RepeatSampler(sampler)[source]¶

Sampler that repeats forever.

Background:: The repeat sampler can be used with the DataLoader with option to re-use worker processes. Learn more here: https://github.com/pytorch/pytorch/issues/15849

Parameters:	sampler (torch.data.utils.sampler.Sampler) –

class torchnlp.samplers.SortedSampler(data, sort_key=<function identity>)[source]¶

Samples elements sequentially, always in the same order.

Parameters:	data (iterable) – Iterable data. sort_key (callable) – Specifies a function of one argument that is used to extract a numerical comparison key from each list element.

Example

>>> list(SortedSampler(range(10), sort_key=lambda i: -i))
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]