Toddwf
Toddwf

Reputation: 31

Resized copy of Pytorch Tensor/Dataset

I have a homemade dataset with a few million rows. I am trying to make truncated copies. So I clip the tensors that I'm using to make the original dataset and create a new dataset. However, when I save the new dataset, which is only 20K rows, it's the same size on disk as the original dataset. Otherwise everything seems kosher, including, when I check, the size of the new tensors. What am I doing wrong?

#original dataset - 2+million rows
dataset = D.TensorDataset(training_data, labels)
torch.save(dataset, filename)

#20k dataset for experiments
d = torch.Tensor(training_data[0:20000])
l = torch.Tensor(labels[0:20000])
ds_small = D.TensorDataset(d,l)
#this is the same size as the one above on disk... approx 1.45GB
torch.save(ds_small, filename_small)

Thanks

Upvotes: 3

Views: 1514

Answers (1)

McLawrence
McLawrence

Reputation: 5245

In your code d and training_data share the same memory, even if you use slicing during the creation of d. I don't know why this is the case, but answer anyway to give you a solution:

d = x[0:10000].clone()
l = y[0:10000].clone()

clonewill give you Tensors with a memory independent from the old Tensor's and the file size will be much smaller.

Note that using torch.Tensor() is not necessary when creating d and l since training_data and labels are already tensors.

Upvotes: 1

Related Questions