qalis
qalis

Reputation: 1533

Validation dataset in PyTorch using DataLoaders

I want to load the MNIST dataset in PyTorch and Torchvision, dividing it into train, validation, and test parts. So far I have:

def load_dataset():
    train_loader = torch.utils.data.DataLoader(
        torchvision.datasets.MNIST(
            '/data/', train=True, download=True,
            transform=torchvision.transforms.Compose([
                torchvision.transforms.ToTensor()])),
        batch_size=batch_size_train, shuffle=True)

    test_loader = torch.utils.data.DataLoader(
        torchvision.datasets.MNIST(
            '/data/', train=False, download=True,
            transform=torchvision.transforms.Compose([
                torchvision.transforms.ToTensor()])),
        batch_size=batch_size_test, shuffle=True)

How can I divide the training dataset into training and validation if it's in the DataLoader? I want to use the last 10000 examples from the training dataset as a validation dataset (I know that I should do a CV for more accurate results, I just want a quick validation here).

Upvotes: 12

Views: 15546

Answers (2)

Eric
Eric

Reputation: 476

If you'd like to ensure your splits have balanced classes, you can use train_test_split from sklearn.

import torchvision
from torch.utils.data import DataLoader, Subset
from sklearn.model_selection import train_test_split

VAL_SIZE = 0.1
BATCH_SIZE = 64

mnist_train = torchvision.datasets.MNIST(
    '/data/',
    train=True,
    download=True,
    transform=torchvision.transforms.Compose([torchvision.transforms.ToTensor()])
)
mnist_test = torchvision.datasets.MNIST(
    '/data/',
    train=False,
    download=True,
    transform=torchvision.transforms.Compose([torchvision.transforms.ToTensor()])
)

# generate indices: instead of the actual data we pass in integers instead
train_indices, val_indices, _, _ = train_test_split(
    range(len(mnist_train)),
    mnist_train.targets,
    stratify=mnist_train.targets,
    test_size=VAL_SIZE,
)

# generate a subset based on indices
train_split = Subset(mnist_train, train_indices)
val_split = Subset(mnist_train, val_indices)

# create batches
train_batches = DataLoader(train_split, batch_size=BATCH_SIZE, shuffle=True)
val_batches = DataLoader(val_split, batch_size=BATCH_SIZE, shuffle=True)
test_batches = DataLoader(mnist_test, batch_size=BATCH_SIZE, shuffle=True)

Upvotes: 3

qalis
qalis

Reputation: 1533

Splitting the training dataset into training and validation in PyTorch turns out to be much harder than it should be.

First, split the training set into training and validation subsets (class Subset), which are not datasets (class Dataset):

train_subset, val_subset = torch.utils.data.random_split(
        train, [50000, 10000], generator=torch.Generator().manual_seed(1))

Then get actual data from those datasets:

X_train = train_subset.dataset.data[train_subset.indices]
y_train = train_subset.dataset.targets[train_subset.indices]

X_val = val_subset.dataset.data[val_subset.indices]
y_val = val_subset.dataset.targets[val_subset.indices]

Note that this way we don't have Dataset objects, so we can't use DataLoader objects for batch training. If you want to use DataLoaders, they work directly with Subsets:

train_loader = DataLoader(dataset=train_subset, shuffle=True, batch_size=BATCH_SIZE)
val_loader = DataLoader(dataset=val_subset, shuffle=False, batch_size=BATCH_SIZE)

Upvotes: 15

Related Questions