comendeiro
comendeiro

Reputation: 836

Python generator with Keras underperforming

I was trying to do a simple image classification exercise using CNN and Keras.

I have a list that stores the directions of the images (train_glob) and another list with the corresponding classification labels one hot encoded (dummy_y).

The function load_one() takes as arguments a path and some parameters for image resizing and augmentation and returns a transformed image as a numpy array.

When I run the code in batch mode though .fit(), creating a single file holding all the images called batch_features I achieved after 5 epochs a decent accuracy of 0.7.

The problem appears when I try to replicate the results using a python generator to feed the data and train using .fit_generator(), the performance results are really poor when in fact I would expected them to be slightly better since, to my understanding, more data is being fed.

Unlike the batch function, in the generator y am randomly altering the brightness of the images and looping more times over the data, so in theory If I understand correctly how the generator works I would expect the results to be better.

This is my generator function

def generate_arrays_from_file(paths,cat_list, batch_size = 128):
    number = 0
    max_len = len(paths)
    while True:
        batch_features = np.zeros((batch_size, 128, 64, 3),np.uint8)
        batch_labels = np.zeros((batch_size,cat_list.shape[1]),np.uint8)
        for i in range(number*batch_size, number*batch_size + batch_size):
            #choose random index in features
            #index= np.random.choice(len(paths))
            batch_features[i % batch_size] = load_one(paths[i], final_size=(64,128), augment = True)
            batch_labels[i % batch_size] = cat_list[i]
        batch_features = normalize_data(batch_features)
        yield batch_features, batch_labels
        number += 1
        if number*batch_size + batch_size > max_len:
            number = 0

An this is the keras call to the generator

mod.fit_generator(generate_arrays_from_file(train_glob, dummy_y, 256),
        samples_per_epoch=16368, nb_epoch=10)

Is this the right way of passing a generator?

Thanks

Upvotes: 0

Views: 1263

Answers (1)

Vlad V
Vlad V

Reputation: 1654

To match your accuracy you want to feed in the same data. Since you do some transformations on the images that you didn't do without the generator, it is normal for your accuracy not to match.

If you think the generator is the problem, you can test it out quite easily.

Fire up a python shell, import your package, make a generator and get a few samples to see if they're what you expected.

# say you save the generator in mygenerator.py

$ python3
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mygenerator

# initialise paths, cat_list here:
>>> paths = [...]
>>> cat_list = [...]

# use a small batch_size to be able to see the results 
>>> g = mygenerator.generate_arrays_from_file(paths, cat_list, batch_size = 2)
>>> batch = g.__next__()
# now check if batch is what you expect

To save an image or display it (from this tutorial):

# Save:
from scipy import misc
misc.imsave('face.png', image_array) # uses the Image module (PIL)

# Display:
import matplotlib.pyplot as plt
plt.imshow(image_array)
plt.show()

More about accuracy and data augmentation

  • If you test the two models (one trained with the generator and one with the whole data preloaded) on different datasets the accuracies will clearly be different. Try to use the exact same test and train data for both models, turn off augmentation completely and you should see similar accuracies (for the same number of epochs, batch_sizes, etc). If you don't use the method above to fix the generator.

  • If there are only few data points the model will overfit (thus have high training accuracy) very quickly. Data augmentation helps reduce overfitting and makes models generalise better. This also means that the accuracy on training after very few epochs will be lower as the data is more varied.

  • Please note it is very easy to get image processing (data augmentation) things wrong and not realise it. Crop it wrongly, you get a black image. Zoom too much you only get noise. Confuse x and y and you get a totally wrong image. And so on... Test your generator to see if the images it outputs are what you expect and that the labels match.

  • On brightness. If you alter the brightness on the input images you make your model agnostic to brightness. You don't improve the generalisation on other things like rotations and zoom. Make sure you do not overdo the brightness changes: do not make your images fully white or fully black - if this happens it will explain the huge drop in accuracy.

  • As pointed in the comments by VMRuiz, if you have categorical data (which you do), use keras.preprocessing.image.ImageDataGenerator (docs). It will save you a lot of time. A very good example on Keras blog (code here). If you are interested in your own image processing have a look at the ImageDataGenerator source code.

Upvotes: 1

Related Questions