Reputation: 127
I am working on a dataset of 300K images doing multi class image classification. So far i took a small dataset of around 7k images, but the code either returns memory error or my notebook just dies. The code below converts all images to a numpy array at once, which results in trouble with my memory when the last row of code gets executed. train.csv contains image-filenames and one hot encoded labels. The code is like this:
data = pd.read_csv('train.csv')
img_width = 400
img_height = 400
img_vectors = []
for i in range(data.shape[0]):
path = 'Images/' + data['Id'][
img = image.load_img(path, target_size=(img_width, img_height, 3))
img = image.img_to_array(img)
img = img/255.0
img_vectors.append(img)
img_vectors = np.array(img_vectors)
Error Message:
MemoryError Traceback (most recent call last)
<ipython-input-13-dd2302ae54e1> in <module>
----> 1 img_vectors = np.array(img_vectors)
MemoryError: Unable to allocate array with shape (7344, 400, 400, 3) and data type float32
I guess I need a batch of smaller arrays for all images to handle memory issue, to avoid having one array with all imagedata at the same time.
On an earlier project i did image-classification without multi-label with around 225k images. Anyway this code doesnt convert all image-data to one giant array. It rather puts the imagedata into smaller batches:
#image preparation
if K.image_data_format() is "channels_first":
input_shape = (3, img_width, img_height)
else:
input_shape = (img_width, img_height, 3)
train_datagen = ImageDataGenerator(rescale=1./255, horizontal_flip=True)
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(train_data_dir, target_size=(img_width, img_height), batch_size=batch_size, class_mode='categorical')
validation_generator = test_datagen.flow_from_directory(validation_data_dir, target_size=(img_width, img_height), batch_size=batch_size, class_mode='categorical')
model = Sequential()
model.add(Conv2D(32, (3,3), input_shape=input_shape))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
...
model.add(Dense(17))
model.add(BatchNormalization(axis=1, momentum=0.6))
model.add(Activation('softmax'))
model.summary()
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.fit_generator(
train_generator,
steps_per_epoch=nb_train_samples // batch_size,
epochs=epochs,
validation_data=validation_generator,
validation_steps=nb_validation_samples // batch_size,
class_weight = class_weight
)
So what i actually need is an approach of how I can handle big datasets of images for multilabel image classification without getting in trouble with memory. Ideal would be to work with a csv-file containing image-filename and one-hot-encoded labels in combination with array batches for learning.
Any help or guesses here would be greatly appreciated.
Upvotes: 0
Views: 806
Reputation: 612
The easiest way to solve the problem you are facing is to write a costume data generator, here is a tutorial that shows how to do this. The idea is that instead of using flow_from_directory
, you create generate a costume dataloader, that reads each image from its source path and gives to y the correspongind labels. Practiclly I think that your data is stored on a .csv file, where each row contain the path to an image, and the labels present in the image. So your datagen will have a function getittem(self, index) that will read the image from the path in raw number index and return along with the target that is obtained by reading the labels in this raw and one hot encode them, then sum them.
Upvotes: 1