torchnlp.samplers package

The torchnlp.samplers package introduces a set of samplers. Samplers sample elements from a dataset. torchnlp.samplers plug into torch.utils.data.distributed.DistributedSampler and torch.utils.data.DataLoader.

class torchnlp.samplers.BalancedSampler(data_source, get_class=<function identity>, get_weight=<function BalancedSampler.<lambda>>, **kwargs)[source]

Weighted sampler with respect for an element’s class.

Parameters:
  • data (iterable) –
  • get_class (callable, optional) – Get the class of an item relative to the entire dataset.
  • get_weight (callable, optional) – Define a weight for each item other than one.
  • kwargs – Additional key word arguments passed onto WeightedRandomSampler.

Example

>>> from torchnlp.samplers import DeterministicSampler
>>>
>>> data = ['a', 'b', 'c'] + ['c'] * 100
>>> sampler = BalancedSampler(data, num_samples=3)
>>> sampler = DeterministicSampler(sampler, random_seed=12)
>>> [data[i] for i in sampler]
['c', 'b', 'a']
class torchnlp.samplers.BPTTBatchSampler(data, bptt_length, batch_size, drop_last, type_='source')[source]

Samples sequentially a batch of source and target slices of size bptt_length.

Typically, such a sampler, is used for language modeling training with backpropagation through time (BPTT).

Reference: https://github.com/pytorch/examples/blob/c66593f1699ece14a4a2f4d314f1afb03c6793d9/word_language_model/main.py#L61

Parameters:
  • data (iterable) –
  • bptt_length (int) – Length of the slice.
  • batch_size (int) – Size of mini-batch.
  • drop_last (bool) – If True, the sampler will drop the last batch if its size would be less than batch_size.
  • type (str, optional) – Type of batch [‘source’|’target’] to load where a target batch is one timestep ahead.

Example

>>> sampler = BPTTBatchSampler(range(100), bptt_length=2, batch_size=3, drop_last=False)
>>> list(sampler)[0] # First Batch
[slice(0, 2, None), slice(34, 36, None), slice(67, 69, None)]
class torchnlp.samplers.BPTTSampler(data, bptt_length, type_='source')[source]

Samples sequentially source and target slices of size bptt_length.

Typically, such a sampler, is used for language modeling training with backpropagation through time (BPTT).

Reference: https://github.com/pytorch/examples/blob/c66593f1699ece14a4a2f4d314f1afb03c6793d9/word_language_model/main.py#L122

Parameters:
  • data (iterable) – Iterable data.
  • bptt_length (int) – Length of the slice.
  • type (str, optional) – Type of slice [‘source’|’target’] to load where a target slice is one timestep ahead

Example

>>> from torchnlp.samplers import BPTTSampler
>>> list(BPTTSampler(range(5), 2))
[slice(0, 2, None), slice(2, 4, None)]
class torchnlp.samplers.BucketBatchSampler(sampler, batch_size, drop_last, sort_key=<function identity>, bucket_size_multiplier=100)[source]

BucketBatchSampler toggles between sampler batches and sorted batches.

Typically, the sampler will be a RandomSampler allowing the user to toggle between random batches and sorted batches. A larger bucket_size_multiplier is more sorted and vice versa.

Background:

BucketBatchSampler is similar to a BucketIterator found in popular libraries like AllenNLP and torchtext. A BucketIterator pools together examples with a similar size length to reduce the padding required for each batch while maintaining some noise through bucketing.

AllenNLP Implementation: https://github.com/allenai/allennlp/blob/master/allennlp/data/iterators/bucket_iterator.py

torchtext Implementation: https://github.com/pytorch/text/blob/master/torchtext/data/iterator.py#L225

Parameters:
  • sampler (torch.data.utils.sampler.Sampler) –
  • batch_size (int) – Size of mini-batch.
  • drop_last (bool) – If True the sampler will drop the last batch if its size would be less than batch_size.
  • sort_key (callable, optional) – Callable to specify a comparison key for sorting.
  • bucket_size_multiplier (int, optional) – Buckets are of size batch_size * bucket_size_multiplier.

Example

>>> from torchnlp.random import set_seed
>>> set_seed(123)
>>>
>>> from torch.utils.data.sampler import SequentialSampler
>>> sampler = SequentialSampler(list(range(10)))
>>> list(BucketBatchSampler(sampler, batch_size=3, drop_last=False))
[[6, 7, 8], [0, 1, 2], [3, 4, 5], [9]]
>>> list(BucketBatchSampler(sampler, batch_size=3, drop_last=True))
[[0, 1, 2], [3, 4, 5], [6, 7, 8]]
class torchnlp.samplers.DeterministicSampler(sampler, random_seed, cuda=False)[source]

Maintains a random state such that sampler returns the same output every process.

Parameters:
  • sampler (torch.data.utils.sampler.Sampler) –
  • random_seed (int) –
  • cuda (bool, optional) – If True this sampler forks the random state of CUDA as well.
class torchnlp.samplers.DistributedBatchSampler(batch_sampler, **kwargs)[source]

BatchSampler wrapper that distributes across each batch multiple workers.

Parameters:
  • batch_sampler (torch.utils.data.sampler.BatchSampler) –
  • num_replicas (int, optional) – Number of processes participating in distributed training.
  • rank (int, optional) – Rank of the current process within num_replicas.

Example

>>> from torch.utils.data.sampler import BatchSampler
>>> from torch.utils.data.sampler import SequentialSampler
>>> sampler = SequentialSampler(list(range(12)))
>>> batch_sampler = BatchSampler(sampler, batch_size=4, drop_last=False)
>>>
>>> list(DistributedBatchSampler(batch_sampler, num_replicas=2, rank=0))
[[0, 2], [4, 6], [8, 10]]
>>> list(DistributedBatchSampler(batch_sampler, num_replicas=2, rank=1))
[[1, 3], [5, 7], [9, 11]]
class torchnlp.samplers.DistributedSampler(iterable, num_replicas=None, rank=None)[source]

Iterable wrapper that distributes data across multiple workers.

Parameters:
  • iterable (iterable) –
  • num_replicas (int, optional) – Number of processes participating in distributed training.
  • rank (int, optional) – Rank of the current process within num_replicas.

Example

>>> list(DistributedSampler(range(10), num_replicas=2, rank=0))
[0, 2, 4, 6, 8]
>>> list(DistributedSampler(range(10), num_replicas=2, rank=1))
[1, 3, 5, 7, 9]
torchnlp.samplers.get_number_of_elements(object_)[source]

Get the sum of the number of elements in all tensors stored in object_.

This is particularly useful for sampling the largest objects based on tensor size like in: OomBatchSampler.__init__.get_item_size.

Parameters:object (any) –
Returns:The number of elements in the object_.
Return type:(int)
class torchnlp.samplers.NoisySortedSampler(data, sort_key=<function identity>, get_noise=<function _uniform_noise>)[source]

Samples elements sequentially with noise.

Background

NoisySortedSampler is similar to a BucketIterator found in popular libraries like AllenNLP and torchtext. A BucketIterator pools together examples with a similar size length to reduce the padding required for each batch. BucketIterator also includes the ability to add noise to the pooling.

AllenNLP Implementation: https://github.com/allenai/allennlp/blob/e125a490b71b21e914af01e70e9b00b165d64dcd/allennlp/data/iterators/bucket_iterator.py

torchtext Implementation: https://github.com/pytorch/text/blob/master/torchtext/data/iterator.py#L225

Parameters:
  • data (iterable) – Data to sample from.
  • sort_key (callable) – Specifies a function of one argument that is used to extract a numerical comparison key from each list element.
  • get_noise (callable) – Noise added to each numerical sort_key.

Example

>>> from torchnlp.random import set_seed
>>> set_seed(123)
>>>
>>> import random
>>> get_noise = lambda i: round(random.uniform(-1, 1))
>>> list(NoisySortedSampler(range(10), sort_key=lambda i: i, get_noise=get_noise))
[0, 1, 2, 3, 5, 4, 6, 7, 9, 8]
class torchnlp.samplers.OomBatchSampler(batch_sampler, get_item_size, num_batches=5)[source]

Out-of-memory (OOM) batch sampler wraps batch_sampler to sample the num_batches largest batches first in attempt to cause any potential OOM errors early.

Credits: https://github.com/allenai/allennlp/blob/3d100d31cc8d87efcf95c0b8d162bfce55c64926/allennlp/data/iterators/bucket_iterator.py#L43

Parameters:
  • batch_sampler (torch.utils.data.sampler.BatchSampler) –
  • get_item_size (callable) – Measure the size of an item given it’s index int.
  • num_batches (int, optional) – The number of the large batches to move to the beginning of the iteration.
class torchnlp.samplers.RepeatSampler(sampler)[source]

Sampler that repeats forever.

Background:
The repeat sampler can be used with the DataLoader with option to re-use worker processes. Learn more here: https://github.com/pytorch/pytorch/issues/15849
Parameters:sampler (torch.data.utils.sampler.Sampler) –
class torchnlp.samplers.SortedSampler(data, sort_key=<function identity>)[source]

Samples elements sequentially, always in the same order.

Parameters:
  • data (iterable) – Iterable data.
  • sort_key (callable) – Specifies a function of one argument that is used to extract a numerical comparison key from each list element.

Example

>>> list(SortedSampler(range(10), sort_key=lambda i: -i))
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]