Reputation: 17144
I was practicing the keras classification for imbalanced data. I followed the official example:
https://keras.io/examples/structured_data/imbalanced_classification/
and used the scikit-learn api to do cross-validation. I have tried the model with different parameter. However, all the times one of the 3 folds has value 0.
eg.
results [0.99242424 0.99236641 0. ]
What am I doing wrong? How to get ALL THREE validation recall values of order "0.8"?
%%time
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
import os
import random
SEED = 100
os.environ['PYTHONHASHSEED'] = str(SEED)
np.random.seed(SEED)
random.seed(SEED)
tf.random.set_seed(SEED)
# load the data
ifile = "https://github.com/bhishanpdl/Datasets/blob/master/Projects/Fraud_detection/raw/creditcard.csv.zip?raw=true"
df = pd.read_csv(ifile,compression='zip')
# train test split
target = 'Class'
Xtrain,Xtest,ytrain,ytest = train_test_split(df.drop([target],axis=1),
df[target],test_size=0.2,stratify=df[target],random_state=SEED)
print(f"Xtrain shape: {Xtrain.shape}")
print(f"ytrain shape: {ytrain.shape}")
# build the model
def build_fn(n_feats):
model = keras.models.Sequential()
model.add(keras.layers.Dense(256, activation="relu", input_shape=(n_feats,)))
model.add(keras.layers.Dense(256, activation="relu"))
model.add(keras.layers.Dropout(0.3))
model.add(keras.layers.Dense(256, activation="relu"))
model.add(keras.layers.Dropout(0.3))
# last layer is dense 1 for binary sigmoid
model.add(keras.layers.Dense(1, activation="sigmoid"))
# compile
model.compile(loss='binary_crossentropy',
optimizer=keras.optimizers.Adam(1e-2),
metrics=['Recall'])
return model
# fitting the model
n_feats = Xtrain.shape[-1]
counts = np.bincount(ytrain)
weight_for_0 = 1.0 / counts[0]
weight_for_1 = 1.0 / counts[1]
class_weight = {0: weight_for_0, 1: weight_for_1}
FIT_PARAMS = {'class_weight' : class_weight}
clf_keras = KerasClassifier(build_fn=build_fn,
n_feats=n_feats, # custom argument
epochs=30,
batch_size=2048,
verbose=2)
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=SEED)
results = cross_val_score(clf_keras, Xtrain, ytrain,
cv=skf,
scoring='recall',
fit_params = FIT_PARAMS,
n_jobs = -1,
error_score='raise'
)
print('results', results)
Xtrain shape: (227845, 30)
ytrain shape: (227845,)
results [0.99242424 0.99236641 0. ]
CPU times: user 3.62 s, sys: 117 ms, total: 3.74 s
Wall time: 5min 15s
I am getting the third recall as 0. I am expecting it of the order 0.8, how to make sure all three values are around 0.8 or more?
Upvotes: 3
Views: 2354
Reputation: 4893
MilkyWay001,
You have chosen to use sklearn
wrappers for your model - they have benefits, but the model training process is hidden. Instead, I trained the model separately with validation dataset added. The code for this would be:
clf_1 = KerasClassifier(build_fn=build_fn,
n_feats=n_feats)
clf_1.fit(Xtrain, ytrain, class_weight=class_weight,
validation_data=(Xtest, ytest),
epochs=30,batch_size=2048,
verbose=1)
In the Model.fit()
output it is clearly seen that while loss metric goes down, recall is not stable. This lead to poor performance in CV reflected in zeros in CV results, as you observed.
I fixed this by reducing learning rate to just 0.0001. While it is 100 times less than yours - it reaches 98% recall on train and 100% (or close) on test in just 10 epochs.
Your code needs just one fix to achieve stable results: change LR to much lower one, like 0.0001:
optimizer=keras.optimizers.Adam(1e-4),
You can experiment with LR in the range < 0.001.
For reference, with LR 0.0001
I got:
results [0.99242424 0.97709924 1. ]
Good luck!
PS: thanks for inluding compact and complete MWE
Upvotes: 2