Reputation: 1646
I used random_split() to divide my data into train and test and I observed that if random split is done after the dataloader is created, batch size is missing when getting a batch of data from the dataloader.
import torch
from torchvision import transforms, datasets
from torch.utils.data import random_split
# Normalize the data
transform_image = transforms.Compose([
transforms.Resize((240, 320)),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
data = '/data/imgs/train'
def load_dataset():
data_path = data
main_dataset = datasets.ImageFolder(
root = data_path,
transform = transform_image
)
loader = torch.utils.data.DataLoader(
dataset = main_dataset,
batch_size= 64,
num_workers = 0,
shuffle= True
)
# Dataset has 22424 data points
trainloader, testloader = random_split(loader.dataset, [21000, 1424])
return trainloader, testloader
trainloader, testloader = load_dataset()
Now to get a single batch of images from the train and test loaders:
images, labels = next(iter(trainloader))
images.shape
# %%
len(trainloader)
# %%
images_test, labels_test = next(iter(testloader))
images_test.shape
# %%
len(testloader)
The output that I get is does not have the batch size for train or test batches. Teh output dims should be [batch x channel x H x W] but I get [channel x H x W].
Output:
But if I create the split from the dataset and then make two data loaders using the splits, I get the batchsize in the output.
def load_dataset():
data_path = data
main_dataset = datasets.ImageFolder(
root = data_path,
transform = transform_image
)
# Dataset has 22424 data points
train_data, test_data = random_split(main_dataset, [21000, 1424])
trainloader = torch.utils.data.DataLoader(
dataset = train_data,
batch_size= 64,
num_workers = 0,
shuffle= True
)
testloader = torch.utils.data.DataLoader(
dataset = test_data,
batch_size= 64,
num_workers= 0,
shuffle= True
)
return trainloader, testloader
trainloader, testloader = load_dataset()
On running the same 4 commands to get a single train and test batch:
Is the first approach wrong? Although the length shows that the data has been split. So why do I not see the batch size?
Upvotes: 1
Views: 4202
Reputation: 22174
The first approach is wrong.
Only DataLoader
instances return batches of items. The Dataset
like instances don't.
When you call make_split
you pass it loader.dataset
which is just a reference to main_dataset
(not a DataLoader
). The result is that trainloader
and testloader
are Dataset
s not DataLoader
s. In fact you discard loader
which is your only DataLoader
when you return from load_dataset
.
The second version is what you should do to get two separate DataLoader
s.
Upvotes: 2
Reputation: 1518
You are splitting a dataset into two. This would give you 2 Datasets, which when iterated upon will return single image tensors of shape channel, height, width
, i.e., 3,h,w
, and does not by default give you Dataloader around these datasets.
What you did next is actually the next right step, i.e., to create a Dataloader around each dataset. You define the batch size in the Dataloader and now iterating over a Dataloader will return tensors of shape batch_size, channel, height, width
.
Even if you intend to feed the model batches of size one, you will have to have a batch size dimension in the tensor. For this, you can either use a Dataloader of batchsize=1
or just add a dummy dimension at the start with torch.unsqueeze(X, 0)
for an image X
or X.unsqueeze(0)
, making the tensor of shape 1,3,h,w
Upvotes: 1