lhk
lhk

Reputation: 30196

Is it possible to split the training DataLoader (and dataset) into training and validation datasets?

The torchvision package provides easy access to commonly used datasets. You would use them like this:

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=2)

Apparently, you can only switch between train=True and train=False. The docs explain:

train (bool, optional) – If True, creates dataset from training.pt, otherwise from test.pt.

But this goes against the common practice of having a three-way split. For serious work, I need another DataLoader with a validation set. Also, it would be nice to specify the split proportions myself. They don't say what percentage of the dataset is reserved for testing, maybe I would like to change that.

I assume that this is a conscious design decision. Everyone working on one of these datasets is supposed to use the same testset. That makes results comparable. But I still need to get a validation set out of the trainloader. Is it possible to split a DataLoader into two separate streams of data?

Upvotes: 6

Views: 6393

Answers (1)

lhk
lhk

Reputation: 30196

Meanwhile, I stumbled upon the method random_split. So, you don't split the DataLoader, but you split the Dataset:

torch.utils.data.random_split(dataset, lengths)

Upvotes: 9

Related Questions