Sebastian Di Ravello
Sebastian Di Ravello

Reputation: 71

Tensorflow: Get color image data based on file names rather than directories

I'm working on a color image classifier in Python using either an SVM or KNN classifier using the Tensorflow library.

The problem is, for various reasons, I'm not allowed to organize my training data into separate directories. For instance, my two classes are "Apples" and "Oranges" but all the image data is in one directory called "training", so I can only sort them by file names, like "Apples1", "Orange1".

This makes it difficult to use methods like tf.keras.utils.image_dataset_from_directory or even tf.keras.utils.image_dataset_from_directory because they look for directory names to determine classes.

Is there a good Tensorflow method for what I'm doing? If not, is there a good way to load and preprocess color image data to work with the Tensorflow library?

Upvotes: 0

Views: 523

Answers (1)

Salman Shah
Salman Shah

Reputation: 56

Warning: Haven't tested any of this in a complete training session. Use as a guideline rather than an exact solution.

Option 1: Build a tf.data.Dataset of your own. There are many ways to do this. Easiest is to use a python generator. For more info, check out these resources: 1, 2, 3.

Here's a minimal example using generators.

Assuming directory structure looks like this:

.
├── script.py
└── training
    ├── Apples1.jpg
    ├── Apples2.jpg
    ├── Apples3.jpg
    ├── Apples4.jpg
    ├── Oranges1.jpg
    ├── Oranges2.jpg
    ├── Oranges3.jpg
    └── Oranges4.jpg

The script.py contains the code below.

import os
import glob
import random
import tensorflow as tf

target_dir = os.path.realpath("./training")
file_paths = glob.glob(f"{target_dir}/*.jpg")

classes = ['apple', 'orange']
  
IMGW, IMGH, CH = 256, 256, 3
def data_generator(fpaths):
  fpath = random.choice(fpaths)
  class_index = [c in os.path.basename(fpath).lower() for c in classes]
  one_hot = tf.constant(class_index, dtype=tf.int32)
  image = tf.keras.utils.load_img(fpath, target_size=(IMGH, IMGW))
  input_arr = tf.keras.utils.img_to_array(image)
  input_arr = tf.expand_dims(input_arr, 0) # add batch dim
  yield input_arr, one_hot

NUM_CLASSES = len(classes)
train_generator = tf.data.Dataset.from_generator(
  data_generator,
  args=[file_paths],
  output_signature=(
    tf.TensorSpec(shape=(IMGH, IMGW, CH)),
    tf.TensorSpec(shape=(NUM_CLASSES,))
  )
)

BATCH_SIZE = 64
train_dataset = train_generator.batch(BATCH_SIZE).prefetch(1)

Then train_dataset can be used in the tf.keras.Model.fit() to train. You can split file_paths to train_file_paths and val_file_paths to choose exactly which files you want to use for training vs validation and come up with a val_dataset using the same procedure.

Option 2: Nest the training directory into another directory.

The method tf.keras.utils.image_dataset_from_directory is a bit funky in my opinion, but I think you can still use it if you provide a list of integers for the labels argument that corresponds to the sorted list of image file paths in your training directory (according to the documentation).

Here's a minimal example.

Assuming you have a directory structure like this:

.
├── images
│   └── training
│       ├── Apples1.jpg
│       ├── Apples2.jpg
│       ├── Apples3.jpg
│       ├── Apples4.jpg
│       ├── Oranges1.jpg
│       ├── Oranges2.jpg
│       ├── Oranges3.jpg
│       └── Oranges4.jpg
└── script.py

Then script.py defined as below will create a tf.data.Dataset object like usual.

Important: Note the target_dir. It's not the training directory as we might expect, but one directory above.

import os
import tensorflow as tf

target_dir = os.path.realpath("./images")

for _, _, file_names in os.walk(target_dir):
  pass

file_names = sorted(file_names)
  
label_map = { 
  "apple": 0,
  "orange": 1,
}
labels = []
for fname in file_names:
  for k, v in label_map.items():
    if k in fname.lower():
      labels.append(v)
      print(f"Assigned label '{k}' to file '{fname}'")
      break

dataset = tf.keras.utils.image_dataset_from_directory(
  target_dir,
  labels=labels,
)

Upvotes: 1

Related Questions