torchnlp.samplers package¶
The torchnlp.samplers
package introduces a set of samplers. Samplers sample elements from a
dataset. torchnlp.samplers
plug into torch.utils.data.distributed.DistributedSampler
and
torch.utils.data.DataLoader
.
-
class
torchnlp.samplers.
BalancedSampler
(data_source, get_class=<function identity>, get_weight=<function BalancedSampler.<lambda>>, **kwargs)[source]¶ Weighted sampler with respect for an element’s class.
Parameters: - data (iterable) –
- get_class (callable, optional) – Get the class of an item relative to the entire dataset.
- get_weight (callable, optional) – Define a weight for each item other than one.
- kwargs – Additional key word arguments passed onto WeightedRandomSampler.
Example
>>> from torchnlp.samplers import DeterministicSampler >>> >>> data = ['a', 'b', 'c'] + ['c'] * 100 >>> sampler = BalancedSampler(data, num_samples=3) >>> sampler = DeterministicSampler(sampler, random_seed=12) >>> [data[i] for i in sampler] ['c', 'b', 'a']
-
class
torchnlp.samplers.
BPTTBatchSampler
(data, bptt_length, batch_size, drop_last, type_='source')[source]¶ Samples sequentially a batch of source and target slices of size
bptt_length
.Typically, such a sampler, is used for language modeling training with backpropagation through time (BPTT).
Parameters: - data (iterable) –
- bptt_length (int) – Length of the slice.
- batch_size (int) – Size of mini-batch.
- drop_last (bool) – If
True
, the sampler will drop the last batch if its size would be less thanbatch_size
. - type (str, optional) – Type of batch [‘source’|’target’] to load where a target batch is one timestep ahead.
Example
>>> sampler = BPTTBatchSampler(range(100), bptt_length=2, batch_size=3, drop_last=False) >>> list(sampler)[0] # First Batch [slice(0, 2, None), slice(34, 36, None), slice(67, 69, None)]
-
class
torchnlp.samplers.
BPTTSampler
(data, bptt_length, type_='source')[source]¶ Samples sequentially source and target slices of size
bptt_length
.Typically, such a sampler, is used for language modeling training with backpropagation through time (BPTT).
Parameters: Example
>>> from torchnlp.samplers import BPTTSampler >>> list(BPTTSampler(range(5), 2)) [slice(0, 2, None), slice(2, 4, None)]
-
class
torchnlp.samplers.
BucketBatchSampler
(sampler, batch_size, drop_last, sort_key=<function identity>, bucket_size_multiplier=100)[source]¶ BucketBatchSampler toggles between sampler batches and sorted batches.
Typically, the sampler will be a RandomSampler allowing the user to toggle between random batches and sorted batches. A larger bucket_size_multiplier is more sorted and vice versa.
- Background:
BucketBatchSampler
is similar to aBucketIterator
found in popular libraries likeAllenNLP
andtorchtext
. ABucketIterator
pools together examples with a similar size length to reduce the padding required for each batch while maintaining some noise through bucketing.AllenNLP Implementation: https://github.com/allenai/allennlp/blob/master/allennlp/data/iterators/bucket_iterator.py
torchtext Implementation: https://github.com/pytorch/text/blob/master/torchtext/data/iterator.py#L225
Parameters: - sampler (torch.data.utils.sampler.Sampler) –
- batch_size (int) – Size of mini-batch.
- drop_last (bool) – If True the sampler will drop the last batch if its size would be less than batch_size.
- sort_key (callable, optional) – Callable to specify a comparison key for sorting.
- bucket_size_multiplier (int, optional) – Buckets are of size batch_size * bucket_size_multiplier.
Example
>>> from torchnlp.random import set_seed >>> set_seed(123) >>> >>> from torch.utils.data.sampler import SequentialSampler >>> sampler = SequentialSampler(list(range(10))) >>> list(BucketBatchSampler(sampler, batch_size=3, drop_last=False)) [[6, 7, 8], [0, 1, 2], [3, 4, 5], [9]] >>> list(BucketBatchSampler(sampler, batch_size=3, drop_last=True)) [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
-
class
torchnlp.samplers.
DeterministicSampler
(sampler, random_seed, cuda=False)[source]¶ Maintains a random state such that sampler returns the same output every process.
Parameters:
-
class
torchnlp.samplers.
DistributedBatchSampler
(batch_sampler, **kwargs)[source]¶ BatchSampler wrapper that distributes across each batch multiple workers.
Parameters: Example
>>> from torch.utils.data.sampler import BatchSampler >>> from torch.utils.data.sampler import SequentialSampler >>> sampler = SequentialSampler(list(range(12))) >>> batch_sampler = BatchSampler(sampler, batch_size=4, drop_last=False) >>> >>> list(DistributedBatchSampler(batch_sampler, num_replicas=2, rank=0)) [[0, 2], [4, 6], [8, 10]] >>> list(DistributedBatchSampler(batch_sampler, num_replicas=2, rank=1)) [[1, 3], [5, 7], [9, 11]]
-
class
torchnlp.samplers.
DistributedSampler
(iterable, num_replicas=None, rank=None)[source]¶ Iterable wrapper that distributes data across multiple workers.
Parameters: Example
>>> list(DistributedSampler(range(10), num_replicas=2, rank=0)) [0, 2, 4, 6, 8] >>> list(DistributedSampler(range(10), num_replicas=2, rank=1)) [1, 3, 5, 7, 9]
-
torchnlp.samplers.
get_number_of_elements
(object_)[source]¶ Get the sum of the number of elements in all tensors stored in object_.
This is particularly useful for sampling the largest objects based on tensor size like in: OomBatchSampler.__init__.get_item_size.
Parameters: object (any) – Returns: The number of elements in the object_. Return type: (int)
-
class
torchnlp.samplers.
NoisySortedSampler
(data, sort_key=<function identity>, get_noise=<function _uniform_noise>)[source]¶ Samples elements sequentially with noise.
Background
NoisySortedSampler
is similar to aBucketIterator
found in popular libraries like AllenNLP and torchtext. ABucketIterator
pools together examples with a similar size length to reduce the padding required for each batch.BucketIterator
also includes the ability to add noise to the pooling.AllenNLP Implementation: https://github.com/allenai/allennlp/blob/e125a490b71b21e914af01e70e9b00b165d64dcd/allennlp/data/iterators/bucket_iterator.py
torchtext Implementation: https://github.com/pytorch/text/blob/master/torchtext/data/iterator.py#L225
Parameters: - data (iterable) – Data to sample from.
- sort_key (callable) – Specifies a function of one argument that is used to extract a numerical comparison key from each list element.
- get_noise (callable) – Noise added to each numerical
sort_key
.
Example
>>> from torchnlp.random import set_seed >>> set_seed(123) >>> >>> import random >>> get_noise = lambda i: round(random.uniform(-1, 1)) >>> list(NoisySortedSampler(range(10), sort_key=lambda i: i, get_noise=get_noise)) [0, 1, 2, 3, 5, 4, 6, 7, 9, 8]
-
class
torchnlp.samplers.
OomBatchSampler
(batch_sampler, get_item_size, num_batches=5)[source]¶ Out-of-memory (OOM) batch sampler wraps batch_sampler to sample the num_batches largest batches first in attempt to cause any potential OOM errors early.
Parameters: - batch_sampler (torch.utils.data.sampler.BatchSampler) –
- get_item_size (callable) – Measure the size of an item given it’s index int.
- num_batches (int, optional) – The number of the large batches to move to the beginning of the iteration.
-
class
torchnlp.samplers.
RepeatSampler
(sampler)[source]¶ Sampler that repeats forever.
- Background:
- The repeat sampler can be used with the
DataLoader
with option to re-use worker processes. Learn more here: https://github.com/pytorch/pytorch/issues/15849
Parameters: sampler (torch.data.utils.sampler.Sampler) –
-
class
torchnlp.samplers.
SortedSampler
(data, sort_key=<function identity>)[source]¶ Samples elements sequentially, always in the same order.
Parameters: - data (iterable) – Iterable data.
- sort_key (callable) – Specifies a function of one argument that is used to extract a numerical comparison key from each list element.
Example
>>> list(SortedSampler(range(10), sort_key=lambda i: -i)) [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]