Scikit learn train_test_split into Pytorch Dataloader

I have a dataset for binary classification with PNGs titled as in the attachment below, where the first 0 or 1 in the title determines its class. They're in a folder called "annotation_class", and I have a small script to separate these:

import cv2,glob
import numpy as np
from sklearn.model_selection import train_test_split

filelist = glob.glob('annotation_class'+'/*.png')
size_row, size_col = 256, 256
X,y = [],[]

for name in filelist:
        img = cv2.imread(name) 
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) 
        img = cv2.resize(img,(size_row, size_col))
        X.append(img)
        y.append(int(name.split('\\')[-1].split('_')[1]))


x_train, x_test, y_train, y_test= train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=4)

The returns are all lists. I'm using Pytorch for this project and would like to make a custom Dataset to use Dataloader, but I'm not sure how best to include these after I've used train_test_split. Should I scrap that altogether and use something else? I'd like to end up with two DataLoader's for training and testing.

sample titles

Upvotes: 1

Views: 5892

Answers (2)

Pann Vandet
Pann Vandet

Reputation: 11

I am also new to ML. However, I used to make DataLoader for the model from train_test_split() before. Here are my steps:

  1. Convert train_test_split to tensor.
  2. Make TensorDataset for DataLoader.

In your case, we can implement that.

x_train, x_test, y_train, y_test= train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=4)

# First step: converting to tensor
x_train_to_tensor = torch.from_numpy(x_train).to(torch.float32)
y_train_to_tensor = torch.from_numpy(y_train).to(torch.long) 
x_test_to_tensor = torch.from_numpy(x_test).to(torch.float32)
y_test_to_tensor = torch.from_numpy(y_test).to(torch.long)

# Second step: Creating TensorDataset for Dataloader
train_dataset = TensorDataset(x_train_to_tensor, y_train_to_tensor)
test_dataset = TensorDataset(x_test_to_tensor, y_test_to_tensor)

train_dataloader = DataLoader(train_dataset, batch_size=16)
test_dataloader = DataLoader(test_dataset, batch_size=16)

Note: Converting all labels to torch.long instead of float32 because the label should be integer scalar in F.nll_loss method. Read here for more details.

Upvotes: 0

ayandas
ayandas

Reputation: 2288

You don't have to rewrite. You can reuse your core data loading logic inside PyTorch Dataset

import cv2,glob
import numpy as np
from sklearn.model_selection import train_test_split

from torch.utils.data import Dataset

class MyCoolDataset(Dataset):

    def __init__(self, dir, train=True):
        filelist = glob.glob(dir + '/*.png')
        ...
        # all your data loading logic using cv2, glob ..
        x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=4)
        
        # two modes - train and test
        if train:
            self.x_data, self.y_data = x_train, y_train
        else:
            self.x_data, self.y_data = x_test, y_test
    
    def __getitem__(self, i):
        return self.x_data[i], self.y_data[i]

Then use a DataLoader as usual

dl = DataLoader(MyCoolDataset(...), batch_size=...)
for X, Y in dl:
    pass

Upvotes: 3

Related Questions