Issues using GridSearchCV with RandomForestClassifier using large data, always showing recall score = 1, so best params becomes redundant

Question

This is my first StackOverflow question and I am in need of help! I have exhaustively searched for answers myself and through experimentation but I am hoping someone from the community can help.

This is work for my dissertation at Uni, so any help would be extremely appreciated.

I will try to summarise as best as possible:

I am working with Scikit-learn classifiers and attempting to tune/CV them with GridSearchCV, to form baselines for future work with Keras/Tensorflow.
My problem lies currently with RandomForestClassifier / GridSearchCV.
I am using large amounts of data. Credit Card Fraud Data from Kaggle here.
The data is imbalanced, so I use SMOTE to oversample so the training split is equal for both 0 class and 1 class (fraud). This is about 200,000 each.

Now to explain the problem:

When I run this GridSearchCV for this data on RandomForestClassifer, the recall score is always = 1. This means that no particular parameters are chosen as 'best'. Also I don't understand why this is always 1. This takes about 6-8 hours to run and so this becomes pointless if every iteration has recall=1.
However, when I simply run a single fit on the data (no GridsearchCV) and do a predict-test. I get around 80-84% score result (again interested in Recall). Which is certainly more realistic.

My thoughts / experiments:

I tried under sampling the data, to 492 of each class, which gives around 90% per GSCV iteration. Seems better but still apparently above average.
Also tried varying training set sizes (50,000, 100,000, ...) and they all give recall=1 for each iteration, too.

My guess was that there is too much data / overfitting / something as to why this was happening. Alternatively I thought that Gridsearch was taking the overall / non-fraud classification metrics which is near 1 in these cases.

Here is a pic of output of running the GSCV on the {0: 200,000, 1: 200,000} training set: GSCV each iteration recall=1 Which as you can see, has score =1 for each fold yet when doing a test/predict with the model after, we get a seemingly valid 80% ish metric in the classification report.

I know the testing set is quite a small number of fraud cases, (just a couple hundred). But this was because I only oversampled training data, to keep fresh (unseen) test data.

So by looking at classification reports, I thought that GridSearchCV could be taking the wrong values (i.e. we are interested in the class=1 metrics). However looking at docs, Pos_label = 1 is a default in the scorers in skikit-learn. So this shouldn't be the issue.

I have tried custom scorers / default scorers etc.

Here is my code (a bit messy but it should be clear what is going on! Note the commented out single RF classifier, without GridSearch):

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import itertools

data = pd.read_csv("creditcard.csv")

# Normalise and reshape the Amount column, so it's values lie between -1 and 1
from sklearn.preprocessing import StandardScaler
data['norm_Amount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1,1))

# Drop the old Amount column and also the Time column as we don't want to include this at this stage
data = data.drop(['Time', 'Amount'], axis=1)

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix,precision_recall_curve,auc,roc_auc_score,roc_curve,recall_score,classification_report 

########################################################
# MODEL SETUP

# Assign variables x and y corresponding to row data and it's class value
X = data.ix[:, data.columns != 'Class']
y = data.ix[:, data.columns == 'Class']

# Whole dataset, training-test data splitting
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)

from collections import Counter
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=1)
X_res, y_res = sm.fit_sample(X_train, y_train)
print('Original dataset shape {}'.format(Counter(data['Class'])))
print('Training dataset shape {}'.format(Counter(y_train['Class'])))
print('Resampled training dataset shape {}'.format(Counter(y_res)))



print 'Random Forest: '
from sklearn.ensemble import RandomForestClassifier

# rf = RandomForestClassifier(n_estimators=250, criterion="gini", max_features=3, max_depth=10)

rf = RandomForestClassifier()
param_grid = { "n_estimators"      : [250, 500, 750],
           "criterion"         : ["gini", "entropy"],
           "max_features"      : [3, 5]}

from sklearn.metrics import recall_score, make_scorer
scorer = make_scorer(recall_score, pos_label=1)


grid_search = GridSearchCV(rf, param_grid, n_jobs=1, cv=3, scoring=scorer, verbose=50)
grid_search.fit(X_res, y_res)
print grid_search.best_params_, grid_search.best_estimator_

# rf.fit(X_res, y_res)
# y_pred = rf.predict(X_test)
y_pred = grid_search.predict(X_test)
from sklearn.metrics import classification_report
print classification_report(y_test, y_pred)
print 'Test recall score: ', recall_score(y_test, y_pred)

Thanks,

Harry

Issues using GridSearchCV with RandomForestClassifier using large data, always showing recall score = 1, so best params becomes redundant

Answers (1)

Related Questions