Rio1210
Rio1210

Reputation: 248

Getting an error while training Resnet50 on Imagenet at 14th Epoch

I am training Resnet50 on imagenet using the script provided from PyTorch (with a slight trivial tweak for my purpose). However, I am getting the following error after 14 epochs of training. I have allocated 4 gpus in the server I'm using to run this. Any pointers as to what this error is about would be appreciated. Thanks a lot!

Epoch: [14][5000/5005]  Time 1.910 (2.018)  Data 0.000 (0.191)  Loss 2.6954 (2.7783)    Total 2.6954 (2.7783)   Reg 0.0000  Prec@1 42.969 (40.556)  Prec@5 64.844 (65.368)   
Test: [0/196]   Time 86.722 (86.722)    Loss 1.9551 (1.9551)    Prec@1 51.562 (51.562)  Prec@5 81.641 (81.641)
Traceback (most recent call last):
  File "main_group.py", line 549, in <module>
  File "main_group.py", line 256, in main
    
  File "main_group.py", line 466, in validate
    if args.gpu is not None:
  File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 801, in __next__
    return self._process_data(data)
  File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise
    raise self.exc_type(msg)
OSError: Caught OSError in DataLoader worker process 11.
Original Traceback (most recent call last):
  File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 138, in __getitem__
    sample = self.loader(path)
  File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 174, in default_loader
    return pil_loader(path)
  File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 155, in pil_loader
    with open(path, 'rb') as f:
OSError: [Errno 5] Input/output error: '/data/users2/oiler/github/imagenet-data/val/n02102973/ILSVRC2012_val_00009130.JPEG'

Upvotes: 0

Views: 301

Answers (1)

Jabrove
Jabrove

Reputation: 863

It is difficult to tell what the problem is just by looking at the error you have posted.

All we know is that there was an issue reading the file at '/data/users2/oiler/github/imagenet-data/val/n02102973/ILSVRC2012_val_00009130.JPEG'.

Try the following:

  1. Confirm the file actually exists.
  2. Confirm that it is infact a valid JPEG and not corrupted (by viewing it).
  3. Confirm that you can open it with Python and also load it manually with PIL.
  4. If none of that works, try deleting the file. Do you get the same error on another file in the folder?

Upvotes: 1

Related Questions