What is PyTorch Dataset supposed to return?

Question

I'm trying to get PyTorch to work with DataLoader, this being said to be the easiest way to handle mini batches, which are in some cases necessary for best performance.

DataLoader wants a Dataset as input.

Most of the documentation on Dataset assumes you are working with an off-the-shelf standard data set e.g. MNIST, or at least with images, and can use existing machinery as a black box. I'm working with non-image data I'm generating myself. My best current attempt to distill the documentation about how to do that, down to a minimal test case, is:

import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader


class Dataset1(Dataset):
    def __init__(self):
        pass

    def __len__(self):
        return 80

    def __getitem__(self, i):
        # actual data is blank, just to test the mechanics of Dataset
        return [0.0, 0.0, 0.0], 1.0


train_dataloader = DataLoader(Dataset1(), batch_size=8)

for X, y in train_dataloader:
    print(f"X: {X}")
    print(f"y: {y.shape} {y.dtype} {y}")
    break


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(3, 10),
            nn.ReLU(),
            nn.Linear(10, 1),
            nn.Sigmoid(),
        )

    def forward(self, x):
        return self.layers(x)


device = torch.device("cpu")
model = Net().to(device)
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

for epoch in range(10):
    for X, y in train_dataloader:
        X, y = X.to(device), y.to(device)

        pred = model(X)
        loss = criterion(pred, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

The output of the above program is:

X: [tensor([0., 0., 0., 0., 0., 0., 0., 0.], dtype=torch.float64), tensor([0., 0., 0., 0., 0., 0., 0., 0.], dtype=torch.float64), tensor([0., 0., 0., 0., 0., 0., 0., 0.], dtype=torch.float64)]
y: torch.Size([8]) torch.float64 tensor([1., 1., 1., 1., 1., 1., 1., 1.], dtype=torch.float64)
Traceback (most recent call last):
  File "C:\ml	est_dataloader.py", line 47, in 
    X, y = X.to(device), y.to(device)
AttributeError: 'list' object has no attribute 'to'

In all the example code I can find, X, y = X.to(device), y.to(device) succeeds, because X is indeed a tensor (whereas it is not in my version). Now I'm trying to find out what exactly converts X to a tensor, because either the example code e.g. https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html does not do so, or I am failing to understand how and where it does.

Does Dataset itself convert things to tensors? The answer seems to be 'sort of'.

It has converted y to a tensor, a column of the y value for every example in the batch. That much, makes sense, though it has used type float64, whereas in machine learning, we usually prefer float32. I am used to the idea that Python always represents scalars in double precision, so the conversion from double to single precision happens at the time of forming a tensor, and that this can be insured by specifying the dtype parameter. But in this case Dataset seems to have formed the tensor implicitly. Is there a place or way to specify the dtype parameter?

X is not a tensor, but a list thereof. It would make intuitive sense if it were a list of the examples in the batch, but instead of a list of 8 elements each containing 3 elements, it's the other way around. So Dataset has transposed the input data, which would make sense if it is forming a tensor to match the shape of y, but instead of making a single 2d tensor, it has made a list of 1d tensors. (And, again, in double precision.) Why? Is there a way to change this behavior?

The answer posted so far to Does pytorch Dataset.__getitem__ have to return a dict? says __getitem__ can return anything. Okay, but then how does the anything get converted to the form the training procedure requires?

Ivan · Accepted Answer

The dataset instance is only tasked with returning a single element of the dataset, which can take many forms: a dict, a list, an int, a float, a tensor, etc...

But the behaviour you are seeing is actually handled by your PyTorch data loader and not by the underlying dataset. This mechanism is called collating and its implementation is done by collate_fn. You can actually provide your own as an argument to a data.DataLoader. The default collate function is provided by PyTorch as default_collate and will handle the vast majority of cases. Please have a look at its documentation, as it gives insights on what possible use cases it can handle.

With this default collate the returned batch will take the same types as the item you returned in your dataset. You should therefore return tensors instead of a list as @dx2-66 explained.

What is PyTorch Dataset supposed to return?

Answers (2)

Related Questions