Bhuvan S
Bhuvan S

Reputation: 213

How to implement K-Fold Cross validation using Image data generator and using Flow from dataframe (using CSV file)

Please show or explain a dummy example code snippet demonstrating K-Fold Cross Validation with Flow_from_Dataframe, Training_Generator, and Valid_Generator objects for Keras. This is the current code I have (no k-fold only simple fitting ):

ImageDataGen object to perform all the augmentations

IMG_SIZE = (150, 150)
core_idg = ImageDataGenerator(samplewise_center=True, 
                              samplewise_std_normalization=True, 
                              horizontal_flip = True, 
                              vertical_flip = False, 
                              height_shift_range= 0.05, 
                              width_shift_range=0.1, 
                              rotation_range=5, 
                              shear_range = 0.1,
                              fill_mode = 'reflect',
                              zoom_range=0.15)

Split Main Dataframe to train_dataframe and valid_dataframe

train_df, valid_df = train_test_split(main_DF, 
                                   test_size = 0.10, 
                                   random_state = 2018,
                                   stratify = df_large['BINARY'].map(lambda x: x))

creating train_gen and valid_gen using flow_from_dataframe method of ImageDatagen object created before.

"IMAGE_NAMES" and "BINARY" are the columns which consists of Image names and label 0 or 1.

all_labels = [ "0" , "1" ]

train_gen = core_idg.flow_from_dataframe(dataframe=train_df,
                                         directory="./DataFolder/",
                                         x_col = 'IMAGE_NAMES',
                                         y_col = 'BINARY',
                                         class_mode = 'categorical',
                                         classes = all_labels,
                                         target_size = IMG_SIZE,
                                         color_mode = 'rgb',
                                         batch_size = 64)

valid_gen = core_idg.flow_from_dataframe(dataframe=valid_df,
                                         directory="./DataFolder/",
                                         x_col = 'IMAGE_NAMES',
                                         y_col = 'BINARY',
                                         class_mode = 'categorical',
                                         classes = all_labels,
                                         target_size = IMG_SIZE,
                                         color_mode = 'rgb',
                                         batch_size = 256)

test_X, test_Y = next(core_idg.flow_from_dataframe(dataframe=valid_df,
                                         directory="./DataFolder/",
                                         x_col = 'IMAGE_NAMES',
                                         y_col = 'BIN_STR',
                                         class_mode = 'categorical',
                                         classes = all_labels,
                                         target_size = IMG_SIZE,
                                         color_mode = 'rgb',
                                         batch_size = 256))

#fitting
hist = model.fit_generator(train_gen, 
                              validation_data = (test_X, test_Y), 
                              epochs = 30, 
                              callbacks = call_list)

Now how to translate this to K-Fold Cross-validation? according to me core_idg has to be created once outside the K-Fold loop and instead of train_df and valid_df we should use the K-Fold method of index to split. So how can the code snippet I mentioned Can be transformed?

Upvotes: 4

Views: 4707

Answers (1)

Bhuvan S
Bhuvan S

Reputation: 213

Something like this worked for me, creating dataframes inside K-fold loop

IMG_SIZE = (150, 150)
core_idg = ImageDataGenerator(samplewise_center=True, 
                              samplewise_std_normalization=True, 
                              horizontal_flip = True, 
                              vertical_flip = False, 
                              height_shift_range= 0.05, 
                              width_shift_range=0.1, 
                              rotation_range=5, 
                              shear_range = 0.1,
                              fill_mode = 'reflect',
                              zoom_range=0.15)

# Training with K-fold cross validation
kf = KFold(n_splits=k_folds, random_state=None, shuffle=True)
X= np.array(df_large["IMAGE_NAMES"])
i = 1
for train_index, test_index in kf.split(X):
    trainData = X[train_index]
    testData = X[test_index]
    ## create train, valid dataframe and thus train_gen , valid_gen for each fold-loop
    train_df = df_large.loc[df_large["IMAGE_NAMES"].isin(list(trainData))]
    valid_df = df_large.loc[df_large["IMAGE_NAMES"].isin(list(testData))]
    #create model object
    model= build_model()
    all_labels = [ "0" , "1" ]
    train_gen = core_idg.flow_from_dataframe(dataframe=train_df,
                                         directory="./DataFolder/",
                                         x_col = 'IMAGE_NAMES',
                                         y_col = 'BINARY',
                                         class_mode = 'categorical',
                                         classes = all_labels,
                                         target_size = IMG_SIZE,
                                         color_mode = 'rgb',
                                         batch_size = 64)
    valid_gen = core_idg.flow_from_dataframe(dataframe=valid_df,
                                         directory="./DataFolder/",
                                         x_col = 'IMAGE_NAMES',
                                         y_col = 'BINARY',
                                         class_mode = 'categorical',
                                         classes = all_labels,
                                         target_size = IMG_SIZE,
                                         color_mode = 'rgb',
                                         batch_size = 256)
    hist = img_classify.fit_generator(
            train_gen,
            steps_per_epoch= len(trainData),
            epochs= n_epochs,
            validation_data=valid_gen,
            callbacks = callback_list
)

If any suggestions to make this better, please comment.

Upvotes: 3

Related Questions