Reputation: 248
I am training Resnet50 on imagenet using the script provided from PyTorch (with a slight trivial tweak for my purpose). However, I am getting the following error after 14 epochs of training. I have allocated 4 gpus in the server I'm using to run this. Any pointers as to what this error is about would be appreciated. Thanks a lot!
Epoch: [14][5000/5005] Time 1.910 (2.018) Data 0.000 (0.191) Loss 2.6954 (2.7783) Total 2.6954 (2.7783) Reg 0.0000 Prec@1 42.969 (40.556) Prec@5 64.844 (65.368)
Test: [0/196] Time 86.722 (86.722) Loss 1.9551 (1.9551) Prec@1 51.562 (51.562) Prec@5 81.641 (81.641)
Traceback (most recent call last):
File "main_group.py", line 549, in <module>
File "main_group.py", line 256, in main
File "main_group.py", line 466, in validate
if args.gpu is not None:
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 801, in __next__
return self._process_data(data)
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
data.reraise()
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise
raise self.exc_type(msg)
OSError: Caught OSError in DataLoader worker process 11.
Original Traceback (most recent call last):
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 138, in __getitem__
sample = self.loader(path)
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 174, in default_loader
return pil_loader(path)
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 155, in pil_loader
with open(path, 'rb') as f:
OSError: [Errno 5] Input/output error: '/data/users2/oiler/github/imagenet-data/val/n02102973/ILSVRC2012_val_00009130.JPEG'
Upvotes: 0
Views: 301
Reputation: 863
It is difficult to tell what the problem is just by looking at the error you have posted.
All we know is that there was an issue reading the file at '/data/users2/oiler/github/imagenet-data/val/n02102973/ILSVRC2012_val_00009130.JPEG'
.
Try the following:
Upvotes: 1