Anuvrat Tiku
Anuvrat Tiku

Reputation: 1646

Pytorch: Batch size is missing in data after torch.utils.random_split() is used on dataloader.dataset

I used random_split() to divide my data into train and test and I observed that if random split is done after the dataloader is created, batch size is missing when getting a batch of data from the dataloader.

import torch
from torchvision import transforms, datasets
from torch.utils.data import random_split

# Normalize the data
transform_image = transforms.Compose([
  transforms.Resize((240, 320)),
  transforms.ToTensor(),
  transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

data = '/data/imgs/train'

def load_dataset():
  data_path = data
  main_dataset = datasets.ImageFolder(
    root = data_path,
    transform = transform_image
  )

  loader = torch.utils.data.DataLoader(
    dataset = main_dataset,
    batch_size= 64,
    num_workers = 0,
    shuffle= True
  )

  # Dataset has 22424 data points
  trainloader, testloader = random_split(loader.dataset, [21000, 1424])

  return trainloader, testloader

trainloader, testloader = load_dataset()

Now to get a single batch of images from the train and test loaders:

images, labels = next(iter(trainloader))
images.shape
# %%
len(trainloader)

# %%
images_test, labels_test = next(iter(testloader))
images_test.shape

# %%
len(testloader)

The output that I get is does not have the batch size for train or test batches. Teh output dims should be [batch x channel x H x W] but I get [channel x H x W].

Output:

enter image description here

But if I create the split from the dataset and then make two data loaders using the splits, I get the batchsize in the output.

def load_dataset():
    data_path = data
    main_dataset = datasets.ImageFolder(
      root = data_path,
      transform = transform_image
    )
    # Dataset has 22424 data points
    train_data, test_data = random_split(main_dataset, [21000, 1424])

    trainloader = torch.utils.data.DataLoader(
      dataset = train_data,
      batch_size= 64,
      num_workers = 0,
      shuffle= True
    )

    testloader = torch.utils.data.DataLoader(
      dataset = test_data,
      batch_size= 64,
      num_workers= 0,
      shuffle= True
    )

    return trainloader, testloader

trainloader, testloader = load_dataset()

On running the same 4 commands to get a single train and test batch:

enter image description here

Is the first approach wrong? Although the length shows that the data has been split. So why do I not see the batch size?

Upvotes: 1

Views: 4202

Answers (2)

jodag
jodag

Reputation: 22174

The first approach is wrong.

Only DataLoader instances return batches of items. The Dataset like instances don't.

When you call make_split you pass it loader.dataset which is just a reference to main_dataset (not a DataLoader). The result is that trainloader and testloader are Datasets not DataLoaders. In fact you discard loader which is your only DataLoader when you return from load_dataset.

The second version is what you should do to get two separate DataLoaders.

Upvotes: 2

dumbPy
dumbPy

Reputation: 1518

You are splitting a dataset into two. This would give you 2 Datasets, which when iterated upon will return single image tensors of shape channel, height, width, i.e., 3,h,w, and does not by default give you Dataloader around these datasets.
What you did next is actually the next right step, i.e., to create a Dataloader around each dataset. You define the batch size in the Dataloader and now iterating over a Dataloader will return tensors of shape batch_size, channel, height, width.

Even if you intend to feed the model batches of size one, you will have to have a batch size dimension in the tensor. For this, you can either use a Dataloader of batchsize=1 or just add a dummy dimension at the start with torch.unsqueeze(X, 0) for an image X or X.unsqueeze(0), making the tensor of shape 1,3,h,w

Upvotes: 1

Related Questions