Reputation: 1212
So this is a minimal code which illustrates the issue:
This is the Dataset:
class IceShipDataset(Dataset):
BAND1='band_1'
BAND2='band_2'
IMAGE='image'
@staticmethod
def get_band_img(sample,band):
pic_size=75
img=np.array(sample[band])
img.resize(pic_size,pic_size)
return img
def __init__(self,data,transform=None):
self.data=data
self.transform=transform
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
sample=self.data[idx]
band1_img=IceShipDataset.get_band_img(sample,self.BAND1)
band2_img=IceShipDataset.get_band_img(sample,self.BAND2)
img=np.stack([band1_img,band2_img],2)
sample[self.IMAGE]=img
if self.transform is not None:
sample=self.transform(sample)
return sample
And this is the code which fails:
PLAY_BATCH_SIZE=4
#load data. There are 1604 examples.
with open('train.json','r') as f:
data=f.read()
data=json.loads(data)
ds=IceShipDataset(data)
playloader = torch.utils.data.DataLoader(ds, batch_size=PLAY_BATCH_SIZE,
shuffle=False, num_workers=4)
for i,data in enumerate(playloader):
print(i)
It gives that weird open files error in the for loop… My torch version is 0.3.0.post4
If you want the json file, it is available at Kaggle (https://www.kaggle.com/c/statoil-iceberg-classifier-challenge)
I should mention that the error has nothing to do with the state of my laptop:
yoni@yoni-Lenovo-Z710:~$ lsof | wc -l
89114
yoni@yoni-Lenovo-Z710:~$ cat /proc/sys/fs/file-max
791958
What am I doing wrong here?
Upvotes: 18
Views: 12855
Reputation: 13088
I know how to fix the error, but I don't have a complete explanation for why it happens.
First, the solution: you need to make sure that the image data is stored as numpy.array
s, when you call json.loads
it loads them as python list
s of float
s. This causes the torch.utils.data.DataLoader
to individually transform each float in the list into a torch.DoubleTensor
.
Have a look at default_collate
in torch.utils.data.DataLoader
- your __getitem__
returns a dict
which is a mapping, so default_collate
gets called again on each element of the dict
. The first couple are int
s, but then you get to the image data which is a list
, i.e. a collections.Sequence
- this is where things get funky as default_collate
is called on each element of the list. This is clearly not what you intended. I don't know what the assumption in torch
is about the contents of a list
versus a numpy.array
, but given the error it would appear that that assumption is being violated.
The fix is pretty trivial, just make sure the two image bands are numpy.array
s, for instance in __init__
def __init__(self,data,transform=None):
self.data=[]
for d in data:
d[self.BAND1] = np.asarray(d[self.BAND1])
d[self.BAND2] = np.asarray(d[self.BAND2])
self.data.append(d)
self.transform=transform
or after you load the json, what ever - doesn't really matter where you do it, as long as you do it.
Why does the above results in too many open files
?
I don't know, but as the comments pointed out, it is likely to do with interprocess communication and lock files on the two queues data is taken from and added to.
Footnote: the train.json
was not available for download from Kaggle due to the competition still being open (??). I made a dummy json file that should have the same structure and tested the fix on that dummy file.
Upvotes: 7