syltruong
syltruong

Reputation: 2723

PyTorch datasets: ImageFolder and subfolder filtering

I would like to use ImageFolder to create an Image Dataset.

My current image directory structure looks like this:

/root
-- train/
---- 001.jpg
---- 002.jpg
---- ....
-- test/
---- 001.jpg
---- 002.jpg
---- ....

I would like to have a dataset dedicated to training data, and a dataset dedicated to test data.

As I understand, doing so:

dataset = ImageFolder(root='root/train')

does not find any images.

Doing

dataset = ImageFolder(root='root')

find images but train and test images are just scrambled together.

ImageFolder has argument loader but I did not manage to find any use-case for it.

How can I discriminate images in the root folder according to the subfolder they belong to?

Upvotes: 1

Views: 8643

Answers (2)

kuzand
kuzand

Reputation: 9806

ImageFolder expects the data folder (the one that you pass as root) to contain subfolders representing the classes to which its images belong. Something like this:

data/
├── train/
|   ├── class_0/
|   |   ├── 001.jpg
|   |   ├── 002.jpg
|   |   └── 003.jpg
|   └── class_1/
|       ├── 004.jpg
|       └── 005.jpg
└── test/
    ├── class_0/
    |   ├── 006.jpg
    |   └── 007.jpg
    └── class_1/
        ├── 008.jpg
        └── 009.jpg

Having the above folder structure you can do the following:

train_dataset = ImageFolder(root='data/train')
test_dataset  = ImageFolder(root='data/test')

Since you don't have that structure, one obvious option is to create class-subfolders and put the images into them. Another option is to create a custom Dataset, see here.

Upvotes: 3

Amit Sharma
Amit Sharma

Reputation: 727

I found the approach to create subfolders of each class, separately for train/val/test, as expected by the ImageFolder to work very well. Here's a script that I created for my own usecase, you can modify it for your own

data_dir = '/content/data/oxford-102-flowers/'
files = ['train.txt','test.txt','valid.txt']

for i in files:
  with open(data_dir + i) as myfile:
    for line in myfile:
      curr = i.split('.')[0]
      l = line.split()
      src = os.path.join(data_dir + l[0])

      dir = os.path.join(data_dir + curr)
      if not os.path.isdir(dir):
        os.mkdir(dir)
      
      sub_dir = os.path.join(dir + '/' + l[1])
      if not os.path.isdir(sub_dir):
        os.mkdir(sub_dir)

      os.system('cp "%s" "%s"' % (src, sub_dir))
print("All files copied to the subfolders")

I was working on the Oxford-102-Dataset and I had three .txt files for each of the train, validation and test set. The txt files contained the location and the name of the image (for eg: jpg/image_05038.jpg 58, where 58 represents the ground truth value of the actual class and 'jpg' was the source folder where all the images were stored)

Upvotes: 0

Related Questions