Reputation: 563
I have a dataset for binary classification with PNGs titled as in the attachment below, where the first 0 or 1 in the title determines its class. They're in a folder called "annotation_class", and I have a small script to separate these:
import cv2,glob
import numpy as np
from sklearn.model_selection import train_test_split
filelist = glob.glob('annotation_class'+'/*.png')
size_row, size_col = 256, 256
X,y = [],[]
for name in filelist:
img = cv2.imread(name)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = cv2.resize(img,(size_row, size_col))
X.append(img)
y.append(int(name.split('\\')[-1].split('_')[1]))
x_train, x_test, y_train, y_test= train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=4)
The returns are all lists. I'm using Pytorch for this project and would like to make a custom Dataset to use Dataloader, but I'm not sure how best to include these after I've used train_test_split
. Should I scrap that altogether and use something else? I'd like to end up with two DataLoader's for training and testing.
Upvotes: 1
Views: 5892
Reputation: 11
I am also new to ML. However, I used to make DataLoader for the model from train_test_split()
before.
Here are my steps:
In your case, we can implement that.
x_train, x_test, y_train, y_test= train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=4)
# First step: converting to tensor
x_train_to_tensor = torch.from_numpy(x_train).to(torch.float32)
y_train_to_tensor = torch.from_numpy(y_train).to(torch.long)
x_test_to_tensor = torch.from_numpy(x_test).to(torch.float32)
y_test_to_tensor = torch.from_numpy(y_test).to(torch.long)
# Second step: Creating TensorDataset for Dataloader
train_dataset = TensorDataset(x_train_to_tensor, y_train_to_tensor)
test_dataset = TensorDataset(x_test_to_tensor, y_test_to_tensor)
train_dataloader = DataLoader(train_dataset, batch_size=16)
test_dataloader = DataLoader(test_dataset, batch_size=16)
Note: Converting all labels to torch.long instead of float32 because the label should be integer scalar in F.nll_loss method. Read here for more details.
Upvotes: 0
Reputation: 2288
You don't have to rewrite. You can reuse your core data loading logic inside PyTorch Dataset
import cv2,glob
import numpy as np
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset
class MyCoolDataset(Dataset):
def __init__(self, dir, train=True):
filelist = glob.glob(dir + '/*.png')
...
# all your data loading logic using cv2, glob ..
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=4)
# two modes - train and test
if train:
self.x_data, self.y_data = x_train, y_train
else:
self.x_data, self.y_data = x_test, y_test
def __getitem__(self, i):
return self.x_data[i], self.y_data[i]
Then use a DataLoader
as usual
dl = DataLoader(MyCoolDataset(...), batch_size=...)
for X, Y in dl:
pass
Upvotes: 3