Jsevillamol
Jsevillamol

Reputation: 2543

Loading batches of images in Keras from pandas dataframe

I have a pandas dataframe with two columns, one that has paths to images and the other has string class labels.

I have also written the following functions, which from the dataframe loads the images, renormalizes them and converts the class labels to one-hot vectors.

def prepare_data(df):
    data_X, data_y = df.values[:,0], df.values[:,1]

    # Load images
    data_X = np.array([np.array(imread(fname)) for fname in data_X])

    # Normalize input
    data_X = data_X / 255 - 0.5

    # Prepare labels
    data_y = np.array([label2int[label] for label in data_y])
    data_y = to_categorical(data_y)

    return data_X, data_y

I want to feed this dataframe to a Keras CNN, but the whole dataset is too big to be loaded in memory at once.

Other answers in this site tell me that for that purpose I should use a Keras ImageDataGenerator, but honestly I do not understand how to do it from the documentation.

What is the easiest way of feeding the data in lazy loaded batches to the model?

If it is a ImageDataGenerator, how do I create a ImageDataGenerator that takes on initialization the Dataframe and passes the batches through my function to create the appropriate numpy arrays? And how do I fit the model using the ImageDataGenerator?

Upvotes: 3

Views: 6392

Answers (2)

sdcbr
sdcbr

Reputation: 7129

ImageDataGenerator is a high-level class that allows to yield data from multiple sources (from np arrays, from directories...) and that includes utility functions to perform image augmentation et cetera.

UPDATE

As of keras-preprocessing 1.0.4, ImageDataGenerator comes with a flow_from_dataframe method which addresses your case. It requires dataframe and directory arguments defined as follows:

dataframe: Pandas dataframe containing the filenames of the
           images in a column and classes in another or column/s
           that can be fed as raw target data.
directory: string, path to the target directory that contains all
           the images mapped in the dataframe.

So no more need to implement it yourself.


Original answer below

In your case, with the dataframe as you describe it, you could also write your own custom generator that makes use of the logic in your prepare_data function as a more minimalistic solution. It's good practice to make use of Keras' Sequence object to do so, since it allows to use multiprocessing (which will help to avoid bottlenecking your gpu, if you are using one).

You can check out the docs on the Sequence object, it contains an implementation example. Eventually, your code would be something along these lines (this is boilerplate code, you will have to add specifics like your label2int function or the image preprocessing logic):

from keras.utils import Sequence
class DataSequence(Sequence):
    """
    Keras Sequence object to train a model on larger-than-memory data.
    """
    def __init__(self, df, batch_size, mode='train'):
        self.df = df # your pandas dataframe
        self.bsz = batch_size # batch size
        self.mode = mode # shuffle when in train mode

        # Take labels and a list of image locations in memory
        self.labels = self.df['label'].values
        self.im_list = self.df['image_name'].tolist()

    def __len__(self):
        # compute number of batches to yield
        return int(math.ceil(len(self.df) / float(self.bsz)))

    def on_epoch_end(self):
        # Shuffles indexes after each epoch if in training mode
        self.indexes = range(len(self.im_list))
        if self.mode == 'train':
            self.indexes = random.sample(self.indexes, k=len(self.indexes))

    def get_batch_labels(self, idx):
        # Fetch a batch of labels
        return self.labels[idx * self.bsz: (idx + 1) * self.bsz]

    def get_batch_features(self, idx):
        # Fetch a batch of inputs
        return np.array([imread(im) for im in self.im_list[idx * self.bsz: (1 + idx) * self.bsz]])

    def __getitem__(self, idx):
        batch_x = self.get_batch_features(idx)
        batch_y = self.get_batch_labels(idx)
        return batch_x, batch_y

You can pass this object to train your model just like a custom generator:

sequence = DataSequence(dataframe, batch_size)
model.fit_generator(sequence, epochs=1, use_multiprocessing=True)

As noted below, it is not required to implement the shuffling logic. It suffices to set the shuffle argument to True in the fit_generator() call. From the docs:

shuffle: Boolean. Whether to shuffle the order of the batches at the beginning of each epoch. Only used with instances of Sequence (keras.utils.Sequence). Has no effect when steps_per_epoch is not None.

Upvotes: 9

Cindy W.
Cindy W.

Reputation: 31

I am new to Keras, so take my advice with a grain of salt. I think you should be using a Keras ImageDataGenerator, in particular, the flow_from_dataframe option, since you said you have a Pandas dataframe. Flow_from_dataframe reads cols of the dataframe to get your filenames and your labels.

Below is a snippet of an example. Look online for tutorials.

train_datagen = ImageDataGenerator(horizontal_flip=True,
                                   vertical_flip=False,
                                   rescale=1/255.0)

train_generator = train_datagen.flow_from_dataframe(     
    dataframe=trainDataframe,  
    directory=imageDir,
    x_col="file", # name of col in data frame that contains file names
    y_col=y_col_list, # name of col with labels
    has_ext=True, 
    batch_size=batch_size,
    shuffle=True,
    save_to_dir=saveDir,
    target_size=(img_width,img_height),
    color_mode='grayscale',
    class_mode='categorical', # for classification task
    interpolation='bilinear')

Upvotes: 3

Related Questions