How to work with big dataset for multi-label image classification in terms of memory and batches

Question

I am working on a dataset of 300K images doing multi class image classification. So far i took a small dataset of around 7k images, but the code either returns memory error or my notebook just dies. The code below converts all images to a numpy array at once, which results in trouble with my memory when the last row of code gets executed. train.csv contains image-filenames and one hot encoded labels. The code is like this:

data = pd.read_csv('train.csv')

img_width = 400
img_height = 400

img_vectors = []

for i in range(data.shape[0]):
    path = 'Images/' + data['Id'][
    img = image.load_img(path, target_size=(img_width, img_height, 3))
    img = image.img_to_array(img)
    img = img/255.0
    img_vectors.append(img)

img_vectors = np.array(img_vectors)

Error Message:

MemoryError                               Traceback (most recent call last)
 in 
----> 1 img_vectors = np.array(img_vectors)

MemoryError: Unable to allocate array with shape (7344, 400, 400, 3) and data type float32

I guess I need a batch of smaller arrays for all images to handle memory issue, to avoid having one array with all imagedata at the same time.

On an earlier project i did image-classification without multi-label with around 225k images. Anyway this code doesnt convert all image-data to one giant array. It rather puts the imagedata into smaller batches:

#image preparation
if K.image_data_format() is "channels_first":
    input_shape = (3, img_width, img_height)
else:
    input_shape = (img_width, img_height, 3)

train_datagen = ImageDataGenerator(rescale=1./255, horizontal_flip=True)
test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(train_data_dir, target_size=(img_width, img_height), batch_size=batch_size, class_mode='categorical')
validation_generator = test_datagen.flow_from_directory(validation_data_dir, target_size=(img_width, img_height), batch_size=batch_size, class_mode='categorical')

model = Sequential()
model.add(Conv2D(32, (3,3), input_shape=input_shape))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
...
model.add(Dense(17))
model.add(BatchNormalization(axis=1, momentum=0.6))
model.add(Activation('softmax'))

model.summary()    

model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

model.fit_generator(
    train_generator,
    steps_per_epoch=nb_train_samples // batch_size,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=nb_validation_samples // batch_size,
    class_weight = class_weight
)

So what i actually need is an approach of how I can handle big datasets of images for multilabel image classification without getting in trouble with memory. Ideal would be to work with a csv-file containing image-filename and one-hot-encoded labels in combination with array batches for learning.

Any help or guesses here would be greatly appreciated.

How to work with big dataset for multi-label image classification in terms of memory and batches

Answers (1)

Related Questions