Reputation: 34
I am trying to use a CNN to classify some image data. I have like 38 000 images of varying sizes (let's say around 400x400 px). I was originally only using a subset of the images and loading them into a list using Open CV, but now that I tried to use all of the images, my RAM ran out. What is the correct way to deal with larger amounts of data in the training process, can I load and train them in batches? If so, how?
I am working in a Python Jupyter Notebook.
Upvotes: 1
Views: 1445
Reputation: 8102
For large data sets the data must be read into the model in batches versus trying to load all the data at once since this will cause an OOM (out of memory ) error. Since you are working with images I recommend using the ImageDataGenerator().flow_from_directory(). Documentation is [here.][1]. To use this you need to arrange your images into directories and sub directories. For example assume you have a data set of dog images and cat images and you want to build a classifier to predict if an image is a dog or a cat. So create a directory called train. Within the train directory create a sub directory called cats and a sub directory called dogs. place the images of cats in the cat directory and the images of dogs in the dog directory. I usually also take some of the images to use for testing so I also create a directory called test. Within it create two sub directories cats and dogs IDENTICALLY named as they were in the train directory. Place your test images in the dog and cat directories. Then use the code below to load in the data.
train_dir=r'c:\train'
test_dir=r'c:\test'
img_height=400
imh_width=400
batch_size=32
epochs=20
train_gen=ImageDataGenerator(rescale=1/255, validation_split=.2)
.flow_from_directory( train_dir,
target_size=(img_height, img_width),
batch_size=batch_size, seed=123,
class_mode='categorical',subset='training'
shuffle=True)
valid_gen= ImageDataGenerator(rescale=1/255, validation_split=.2)
.flow_from_directory( train_dir,
target_size=(img_height, img_width),
batch_size=batch_size, seed=123,
class_mode='categorical',subset='validation'
shuffle=False)
test_gen=ImageDataGenerator(rescale=1/255).flow_from_directory(test_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical',
shuffle=False)
Then build and compile your model. Use loss as categorical_crossentropy. Then fit the model
history=model.fit(x=train_gen, epochs=epochs, verbose=1, validation_data=valid_gen)
This is set up to create validation data so you can monitor the model performance in training. When training is complete you can test your model on the test set with
accuracy=model.evaluate( test_gen, batch_size=batch_size, verbose=1, steps=None)[1]*100
print ('Model accuracy on the test set is ' accuracy)
[1]: https://keras.io/api/preprocessing/image/
Upvotes: 2
Reputation: 31
You will have to train in batches of data, loading them from the hard drive. You have this automated with tf.data.Dataset library, check it out: https://www.tensorflow.org/api_docs/python/tf/data/Dataset
Upvotes: 0