Batches of points with the same label on Pytorch

Question

I want to train a neural network using gradient descent on batches that contain N training points each. I would like these batches to only contain points with the same label, instead of being randomly sampled from the training set.

For example, if I'm training using MNIST, I would like to have batches that look like the following:

batch_1 = {0,0,0,0,0,0,0,0}

batch_2 = {3,3,3,3,3,3,3,3}

batch_3 = {7,7,7,7,7,7,7,7}

.....

and so on.

How can I do it using pytorch?

kuzand · Accepted Answer

One way to do it is to create subsets and dataloaders for each class and then iterate by randomly switching between the dataloaders at each iteration:

import torch
from torch.utils.data import DataLoader, Subset
from torchvision.datasets import MNIST
from torchvision import transforms
import numpy as np

dataset = MNIST('path/to/mnist_root/', 
                transform=transforms.ToTensor(),
                download=True)

class_inds = [torch.where(dataset.targets == class_idx)[0]
              for class_idx in dataset.class_to_idx.values()]

dataloaders = [
    DataLoader(
        dataset=Subset(dataset, inds),
        batch_size=8,
        shuffle=True,
        drop_last=False)
    for inds in class_inds]

epochs = 1

for epoch in range(epochs):
    iterators = list(map(iter, dataloaders))   
    while iterators:         
        iterator = np.random.choice(iterators)
        try:
            images, labels = next(iterator)   
            print(labels)
            # do_more_stuff()

        except StopIteration:
            iterators.remove(iterator)

This will work with any dataset (not just the MNIST). Here's the result of printing the labels at each iteration:

tensor([6, 6, 6, 6, 6, 6, 6, 6])
tensor([3, 3, 3, 3, 3, 3, 3, 3])
tensor([0, 0, 0, 0, 0, 0, 0, 0])
tensor([5, 5, 5, 5, 5, 5, 5, 5])
tensor([8, 8, 8, 8, 8, 8, 8, 8])
tensor([0, 0, 0, 0, 0, 0, 0, 0])
...
tensor([1, 1, 1, 1, 1, 1, 1, 1])
tensor([1, 1, 1, 1, 1, 1])

Note that by setting drop_last=False, there will be batches, here and there, with less than batch_size elements. By setting it to True, the batches will be all of equal size, but some data points will be dropped.

Batches of points with the same label on Pytorch

Answers (1)

Related Questions