Reputation: 363
I am trying to load data using pytorch's Dataset and DataLoader classes. I use torch.from_numpy
to convert each array to a tensor in the torch Dataset and from looking into the data, each X and y is indeed a tensor
# At this point dataset is {'X': numpy array of arrays, 'y': numpy array of arrays }
class TorchDataset(torch.utils.data.Dataset):
def __init__(self, dataset):
self.X_train = torch.from_numpy(dataset['X'])
self.y_train = torch.from_numpy(dataset['y'])
def __len__(self):
return len(self.X_train)
def __getitem__(self, index):
return {'X': self.X_train[index], 'y': self.y_train[index]}
torch_dataset = TorchDataset(dataset)
dataloader = DataLoader(torch_dataset, batch_size=4, shuffle=True, num_workers=4)
for epoch in range(num_epochs):
for X, y in enumerate(dataloader):
features = Variable(X)
labels = Variable(y)
....
However on features = Variable(X)
i get:
RuntimeError: Variable data has to be a tensor, but got int
An example of an X and y in the dataset are:
In [1]: torch_dataset[1]
Out[1]:
{'X':
-2.5908 -3.1123 -2.9460 ... -3.9898 -4.0000 -3.9975
-3.0867 -2.9992 -2.5254 ... -4.0000 -4.0000 -4.0000
-2.7665 -2.5318 -2.7035 ... -4.0000 -4.0000 -4.0000
... ⋱ ...
-2.4784 -2.6061 -1.6280 ... -4.0000 -4.0000 -4.0000
-2.2046 -2.1778 -1.5626 ... -3.9597 -3.9366 -3.9497
-1.9623 -1.9468 -1.5352 ... -3.8485 -3.8474 -3.8474
[torch.DoubleTensor of size 1024x1024], 'y':
107
[torch.LongTensor of size 1]}
which is why it is very confusing for me that torch thinks X is an int. Any help would be much appreciated - thanks!
Upvotes: 0
Views: 4232
Reputation: 28349
There is error in your use of enumerate
which caused the error because the first return value of enumerate
is the batch index, not the actual data. There are two ways you can make your script work.
Since your X
and y
is do not need special process. You can just return a sample of X
and y
. Change your __getitem__
method to
def __getitem__(self, index):
return self.X_train[index], self.y_train[index]
Also, change your training loop a little bit:
for epoch in range(num_epochs):
for batch_id, (x, y) in enumerate(dataloader):
x = Variable(x)
y = Variable(y)
# then do whatever you want to do
You can return a dict in the __getitem__
method and extract the actual data in the training loop. In this case, you do not need to change the __getitem__
method. Just change your training loop:
for epoch in range(num_epochs):
for batch_id, data in enumerate(dataloader):
# data will be dict
x = Variable(data['X'])
y = Variable(data['y'])
# then do whatever you want to do
Upvotes: 2
Reputation: 1297
Notice that you are using enumerate in the for-loop. So, what you are doing is the following
for batch_index, batch in enumerate(dataloader):
Upvotes: 1