Matt  Ward
Matt Ward

Reputation: 85

Is Keras fit_generator the best thing to use when handling data that does not fit in RAM?

I am working to build a classifier that can classify Knots. Currently I have a dataset that contains 100,000 images of "unknots", 100,000 "plus trefoil", and 100,000 "minus trefoil".

I have been trying to get a classifier working on this large data set for the last 4ish days. The list of problems I had so far were:

1) Data set does not fit in CPU main memory: fixed this by making a number of EArrays with PyTables and Hdf5 and appending them on disk. So now I have one 1.2Gb file which is the dataset.

2) Even a very simple neural network in Keras was hitting 100% GPU (nvidia k80) memory usage after the model compiled... hadn't even fit the model yet. I read this was due to Keras backend automatically allocating near 100% of available resources upon compile: I fixed this as well.

3) Once errors 1 and 2 were fixed, I was still getting strange accuracy from Keras fit_generator().

Questions:

1) Is the method I have described of merging small numpy arrays into one large EArray using PyTables a good way to make a very large (300,000 images 128x128, total size = 1.2Gb) dataset?

2) Should fit generator be used over train_on_batch in Keras? Will they return significantly different final loss/accuracy scores?

3) What is wrong with my generator method if I want to train the neural network on batches of 50 images from the hdf5 file, and after each training epoch, delete the images the network just trained on out of main memory?

import tables
hdf5_path = "300K_Knot_data.hdf5"
extendable_hdf5_file = tables.open_file(hdf5_path, mode='r')

def imageLoader(files, batch_size):

    L = len(files.root.train_data)

    #this line is just to make the generator infinite, keras needs that    
    while True:

        batch_start = 0
        batch_end = batch_size

        while batch_start < L:
            limit = min(batch_end, L)
            X = files.root.train_data[batch_start:limit]
            X = np.reshape(X, (X.shape[0],128,128,3))
            X = X/255
            Y = files.root.train_label[batch_start:limit]

            yield (X,Y) #a tuple with two numpy arrays with batch_size samples     

            batch_start += batch_size   
            batch_end += batch_size

img_rows, img_cols = 128,128
###################################
# TensorFlow wizardry
config = tf.ConfigProto()

# Don't pre-allocate memory; allocate as-needed
config.gpu_options.allow_growth = True

# Only allow a total of half the GPU memory to be allocated
config.gpu_options.per_process_gpu_memory_fraction = 0.5

# Create a session with the above options specified.
K.tensorflow_backend.set_session(tf.Session(config=config))
###################################

model = Sequential()
model.add(Conv2D(64, (3,3), input_shape=(img_rows,img_cols,3)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Conv2D(96, (3,3), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Conv2D(128, (3,3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))


model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(3))
model.add(Activation('softmax', name='preds'))


#lr was originally 0.01

model.compile(loss='categorical_crossentropy',
optimizer=keras.optimizers.Adagrad(lr=0.01, epsilon=None, decay=0.0),metrics= 
['accuracy'])

# fits the model on batches with real-time data augmentation:

model.fit_generator(imageLoader(extendable_hdf5_file, 50),
                steps_per_epoch=240000 / 50, epochs=50)

Here is the output of fitting the model:

Output of fitting the model.

Sorry if this is not the correct place for the post. My research adviser is out of town, and I have spent a considerable amount of time trying to fix this issue and I just need some input, because I cannot find an adequate solution on the internet, or cannot seem to wrap my head around what I do find well enough to implement it in a nice manner. I am not a new programmer, but am relatively inexperienced in python.

Upvotes: 4

Views: 3401

Answers (1)

today
today

Reputation: 33470

I think the way you have defined your generator is fine. Nothing seems to be wrong with it. Although, currently using Sequence class is the recommended way especially because it is much safer when doing multi-processing. However, using generator is fine and it is still heavily used.

As for the weird accuracy numbers, you mentioned that it is 100% for 3000 steps and then it drops to 33%. I have the following recommendations to diagnose this:

1) Decrease the learning rate to for example 3e-3 or 1e-3 or 1-e4. I recommend to use an adaptive optimizer like Adagrad or RMSprop or Adam. I recommend RMSprop with the default parameters; don't change any parameters first, instead experiment with it and make changes according to the feedback you get (if the loss decreases very slowly then increase the learning rate a bit; if it increases or is stable then decrease the learning rate, though these are not definite rules. You must experiment and take into account the validation loss as well). Using an adaptive optimizer reduce the need of using a callback like ReduceLearningRateOnPlateu (at least not until you have solid reasons to use it).

2) If it is possible for you, split your whole data into train/validation/test sets (or at least train/validation). The ratios of 60/20/20 or 70/15/15 are the most commonly used ones. But make sure that the classes in each of those three sets are equally represented (i.e. you have more or less the same number of "unknots", "plus trefoil", "minus trefoil" in each set). Note that their distribution should be more or less the same as well. For example you should not handpick the easy ones for the validation and test set. Usually selecting after shuffling the whole data (i.e. to make sure they are not in any particular order) would work. Creating a validation set helps you to make sure that the progress you see during training process is real and the model is not over-fitting on the training data.

You need to define a new generator for validation and pass the validation data (i.e. load the file which consists of validation data) the same way you have done with training data. Then set the validation_data and validation_steps arguments of fit_generator method with appropriate values.

3) Using batch normalization before or after the activation is still debatable. Though, some people claim that it works better if you put it after the activation or exactly right before the next layer. You can experiment with this as well.

4) If you are not using your GPU for anything else, then use its full capacity (i.e. don't limit its RAM usage). And use a batch size which is a power of two (i.e. 64, 128, 256, etc.) since it helps with GPU memory allocation and may speed it up. 128 or 256 seems good choices considering the number of samples you have.

5) If training for one epoch takes a lot of time then consider using ModelCheckpoint and EarlyStopping callbacks.

Upvotes: 4

Related Questions