Reputation: 563
I want to train a neural network using gradient descent on batches that contain N training points each. I would like these batches to only contain points with the same label, instead of being randomly sampled from the training set.
For example, if I'm training using MNIST, I would like to have batches that look like the following:
batch_1 = {0,0,0,0,0,0,0,0}
batch_2 = {3,3,3,3,3,3,3,3}
batch_3 = {7,7,7,7,7,7,7,7}
.....
and so on.
How can I do it using pytorch?
Upvotes: 3
Views: 1523
Reputation: 9806
One way to do it is to create subsets and dataloaders for each class and then iterate by randomly switching between the dataloaders at each iteration:
import torch
from torch.utils.data import DataLoader, Subset
from torchvision.datasets import MNIST
from torchvision import transforms
import numpy as np
dataset = MNIST('path/to/mnist_root/',
transform=transforms.ToTensor(),
download=True)
class_inds = [torch.where(dataset.targets == class_idx)[0]
for class_idx in dataset.class_to_idx.values()]
dataloaders = [
DataLoader(
dataset=Subset(dataset, inds),
batch_size=8,
shuffle=True,
drop_last=False)
for inds in class_inds]
epochs = 1
for epoch in range(epochs):
iterators = list(map(iter, dataloaders))
while iterators:
iterator = np.random.choice(iterators)
try:
images, labels = next(iterator)
print(labels)
# do_more_stuff()
except StopIteration:
iterators.remove(iterator)
This will work with any dataset (not just the MNIST). Here's the result of printing the labels at each iteration:
tensor([6, 6, 6, 6, 6, 6, 6, 6])
tensor([3, 3, 3, 3, 3, 3, 3, 3])
tensor([0, 0, 0, 0, 0, 0, 0, 0])
tensor([5, 5, 5, 5, 5, 5, 5, 5])
tensor([8, 8, 8, 8, 8, 8, 8, 8])
tensor([0, 0, 0, 0, 0, 0, 0, 0])
...
tensor([1, 1, 1, 1, 1, 1, 1, 1])
tensor([1, 1, 1, 1, 1, 1])
Note that by setting drop_last=False
, there will be batches, here and there, with less than batch_size
elements. By setting it to True, the batches will be all of equal size, but some data points will be dropped.
Upvotes: 6