Reputation: 190
I work with many images (10M+) stored in a single directory (no subfolders for each class) and use a pandas DataFrame to keep track of class label. The amount of images do not fit in memory so I must read minibatches from disk. So far I have used Keras .flow_from_directory(), but it requires me to move images to one subfolder per class (and per train/validation split). It works great, but it just becomes very unpractical when I want to use differnt subsets of images and define classes in various ways. Do anyone have an alternative strategy that usees a database (e.g. pandas.DataFrame) to keep track of the reading of minibatches instead of moving images to subfolders?
Upvotes: 0
Views: 734
Reputation: 8527
You need a custom data generator.
import numpy as np
import cv2
def batch_generator(ids):
while True:
for start in range(0, len(ids), batch_size):
x_batch = []
y_batch = []
end = min(start + batch_size, len(ids))
ids_batch = ids[start:end]
for id in ids_batch:
img = cv2.imread(dpath+'train/{}.jpg'.format(id))
#img = cv2.resize(img, (224, 224), interpolation = cv2.INTER_AREA)
labelname=df_train.loc[df_train.id==id,'column_name'].values
labelnum=classes.index(labelname)
x_batch.append(img)
y_batch.append(labelnum)
x_batch = np.array(x_batch, np.float32)
y_batch = to_categorical(y_batch,120)
yield x_batch, y_batch
Then you can call the generator only with ids (or image names) numpy array like this:
model.fit_generator(generator=batch_generator(ids_train_split), \
steps_per_epoch= \
np.ceil(float(len(ids_train_split)) / float(batch_size)),\
epochs=epochs, verbose=1, callbacks=callbacks, \
validation_data=batch_generator(ids_valid_split), \
validation_steps=np.ceil(float(len(ids_valid_split)) / float(batch_size)))
Upvotes: 2