Reputation: 71
I'm working on a color image classifier in Python using either an SVM or KNN classifier using the Tensorflow library.
The problem is, for various reasons, I'm not allowed to organize my training data into separate directories. For instance, my two classes are "Apples" and "Oranges" but all the image data is in one directory called "training", so I can only sort them by file names, like "Apples1", "Orange1".
This makes it difficult to use methods like tf.keras.utils.image_dataset_from_directory
or even tf.keras.utils.image_dataset_from_directory
because they look for directory names to determine classes.
Is there a good Tensorflow method for what I'm doing? If not, is there a good way to load and preprocess color image data to work with the Tensorflow library?
Upvotes: 0
Views: 523
Reputation: 56
Warning: Haven't tested any of this in a complete training session. Use as a guideline rather than an exact solution.
Option 1: Build a tf.data.Dataset
of your own.
There are many ways to do this. Easiest is to use a python generator. For more info, check out these resources: 1, 2, 3.
Here's a minimal example using generators.
Assuming directory structure looks like this:
.
├── script.py
└── training
├── Apples1.jpg
├── Apples2.jpg
├── Apples3.jpg
├── Apples4.jpg
├── Oranges1.jpg
├── Oranges2.jpg
├── Oranges3.jpg
└── Oranges4.jpg
The script.py
contains the code below.
import os
import glob
import random
import tensorflow as tf
target_dir = os.path.realpath("./training")
file_paths = glob.glob(f"{target_dir}/*.jpg")
classes = ['apple', 'orange']
IMGW, IMGH, CH = 256, 256, 3
def data_generator(fpaths):
fpath = random.choice(fpaths)
class_index = [c in os.path.basename(fpath).lower() for c in classes]
one_hot = tf.constant(class_index, dtype=tf.int32)
image = tf.keras.utils.load_img(fpath, target_size=(IMGH, IMGW))
input_arr = tf.keras.utils.img_to_array(image)
input_arr = tf.expand_dims(input_arr, 0) # add batch dim
yield input_arr, one_hot
NUM_CLASSES = len(classes)
train_generator = tf.data.Dataset.from_generator(
data_generator,
args=[file_paths],
output_signature=(
tf.TensorSpec(shape=(IMGH, IMGW, CH)),
tf.TensorSpec(shape=(NUM_CLASSES,))
)
)
BATCH_SIZE = 64
train_dataset = train_generator.batch(BATCH_SIZE).prefetch(1)
Then train_dataset
can be used in the tf.keras.Model.fit()
to train. You can split file_paths
to train_file_paths
and val_file_paths
to choose exactly which files you want to use for training vs validation and come up with a val_dataset
using the same procedure.
Option 2: Nest the training
directory into another directory.
The method tf.keras.utils.image_dataset_from_directory
is a bit funky in my opinion, but I think you can still use it if you provide a list of integers for the labels
argument that corresponds to the sorted list of image file paths in your training
directory (according to the documentation).
Here's a minimal example.
Assuming you have a directory structure like this:
.
├── images
│ └── training
│ ├── Apples1.jpg
│ ├── Apples2.jpg
│ ├── Apples3.jpg
│ ├── Apples4.jpg
│ ├── Oranges1.jpg
│ ├── Oranges2.jpg
│ ├── Oranges3.jpg
│ └── Oranges4.jpg
└── script.py
Then script.py
defined as below will create a tf.data.Dataset
object like usual.
Important: Note the target_dir
. It's not the training
directory as we might expect, but one directory above.
import os
import tensorflow as tf
target_dir = os.path.realpath("./images")
for _, _, file_names in os.walk(target_dir):
pass
file_names = sorted(file_names)
label_map = {
"apple": 0,
"orange": 1,
}
labels = []
for fname in file_names:
for k, v in label_map.items():
if k in fname.lower():
labels.append(v)
print(f"Assigned label '{k}' to file '{fname}'")
break
dataset = tf.keras.utils.image_dataset_from_directory(
target_dir,
labels=labels,
)
Upvotes: 1