Reputation: 11
This is my first StackOverflow question and I am in need of help! I have exhaustively searched for answers myself and through experimentation but I am hoping someone from the community can help.
This is work for my dissertation at Uni, so any help would be extremely appreciated.
I will try to summarise as best as possible:
Now to explain the problem:
My thoughts / experiments:
My guess was that there is too much data / overfitting / something as to why this was happening. Alternatively I thought that Gridsearch was taking the overall / non-fraud classification metrics which is near 1 in these cases.
Here is a pic of output of running the GSCV on the {0: 200,000, 1: 200,000} training set: GSCV each iteration recall=1 Which as you can see, has score =1 for each fold yet when doing a test/predict with the model after, we get a seemingly valid 80% ish metric in the classification report.
I know the testing set is quite a small number of fraud cases, (just a couple hundred). But this was because I only oversampled training data, to keep fresh (unseen) test data.
So by looking at classification reports, I thought that GridSearchCV could be taking the wrong values (i.e. we are interested in the class=1 metrics). However looking at docs, Pos_label = 1 is a default in the scorers in skikit-learn. So this shouldn't be the issue.
I have tried custom scorers / default scorers etc.
Here is my code (a bit messy but it should be clear what is going on! Note the commented out single RF classifier, without GridSearch):
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import itertools
data = pd.read_csv("creditcard.csv")
# Normalise and reshape the Amount column, so it's values lie between -1 and 1
from sklearn.preprocessing import StandardScaler
data['norm_Amount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1,1))
# Drop the old Amount column and also the Time column as we don't want to include this at this stage
data = data.drop(['Time', 'Amount'], axis=1)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix,precision_recall_curve,auc,roc_auc_score,roc_curve,recall_score,classification_report
########################################################
# MODEL SETUP
# Assign variables x and y corresponding to row data and it's class value
X = data.ix[:, data.columns != 'Class']
y = data.ix[:, data.columns == 'Class']
# Whole dataset, training-test data splitting
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)
from collections import Counter
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=1)
X_res, y_res = sm.fit_sample(X_train, y_train)
print('Original dataset shape {}'.format(Counter(data['Class'])))
print('Training dataset shape {}'.format(Counter(y_train['Class'])))
print('Resampled training dataset shape {}'.format(Counter(y_res)))
print 'Random Forest: '
from sklearn.ensemble import RandomForestClassifier
# rf = RandomForestClassifier(n_estimators=250, criterion="gini", max_features=3, max_depth=10)
rf = RandomForestClassifier()
param_grid = { "n_estimators" : [250, 500, 750],
"criterion" : ["gini", "entropy"],
"max_features" : [3, 5]}
from sklearn.metrics import recall_score, make_scorer
scorer = make_scorer(recall_score, pos_label=1)
grid_search = GridSearchCV(rf, param_grid, n_jobs=1, cv=3, scoring=scorer, verbose=50)
grid_search.fit(X_res, y_res)
print grid_search.best_params_, grid_search.best_estimator_
# rf.fit(X_res, y_res)
# y_pred = rf.predict(X_test)
y_pred = grid_search.predict(X_test)
from sklearn.metrics import classification_report
print classification_report(y_test, y_pred)
print 'Test recall score: ', recall_score(y_test, y_pred)
Thanks,
Harry
Upvotes: 1
Views: 1810
Reputation: 11
This is a problem of overfitting. when you use cross-validating with oversampling, it is important that oversampling should only be applied to training data but not validation data, i.e. for a 10-fold cross-validation, 9 folds oversample data will be used as training set, and one fold as validation set without oversampling.
Upvotes: 1