Reputation: 379
I'm trying to use imblearn to plot a ROC curve but run into some problem.
here's a screenshot of my data
from imblearn.over_sampling import SMOTE, ADASYN
from collections import Counter
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
import sys
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from scipy import interp
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
# Import some data to play with
df = pd.read_csv("E:\\autodesk\\Hourly and weather ml.csv")
# X and y are different columns of the input data. Input X as numpy array
X = df[['TTI','Max TemperatureF','Mean TemperatureF','Min TemperatureF',' Min Humidity']].values
# # Reshape X. Do this if X has only one value per data point. In this case, TTI.
# # Input y as normal list
y = df['TTI_Category'].as_matrix()
X_resampled, y_resampled = SMOTE().fit_sample(X, y)
y_resampled = label_binarize(y_resampled, classes=['Good','Bad','Ok'])
n_classes = y_resampled.shape[1]
# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
random_state=0)
# Learn to predict each class against the other
classifier = OneVsRestClassifier(DecisionTreeClassifier(random_state=0))
y_score=classifier.fit(X_resampled, y_resampled).predict_proba(X_test)
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
plt.figure()
I changed the original X_train and y_train
to X_resampled, y_resampled
since the training should be done on the resampled dataset and the test need to be done on the original test dataset. However I got the following traceback `
runfile('E:/autodesk/SMOTE with multiclass.py', wdir='E:/autodesk')
Traceback (most recent call last):
File "<ipython-input-128-efb16ffc92ca>", line 1, in <module>
runfile('E:/autodesk/SMOTE with multiclass.py', wdir='E:/autodesk')
File "C:\Users\Think\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
execfile(filename, namespace)
File "C:\Users\Think\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "E:/autodesk/SMOTE with multiclass.py", line 51, in <module>
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
IndexError: too many indices for array
I have added another line to binarize both y_resampled and original y, and everything else stay the same, but I'm not sure if i'm fitting the resampled data and testing the original data
X_resampled, y_resampled = SMOTE().fit_sample(X, y)
y_resampled = label_binarize(y_resampled, classes=['Good','Bad','Ok'])
y = label_binarize(y, classes=['Good','Bad','Ok'])
n_classes = y.shape[1]
Thanks a lot for help.
Upvotes: 2
Views: 414
Reputation: 36619
First lets discuss about the error. You are doing this:
y_resampled = label_binarize(y_resampled, classes=['Good','Bad','Ok'])
n_classes = y_resampled.shape[1]
So your n_classes
is actually 3.
In the subsequent part, you did this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
random_state=0)
Here you used the original y
, not y_resampled
. So the y_test
currently is a 1-D array of shape (n_samples,)
or maybe column vector of shape (n_samples, 1)
.
In the for loop, you start iterating from 0 to 3 (n_classes) which is not possible for y_test
and hence the error that the index you are trying to access in y_test
is not present.
Secondly, you should first split the data into train and test and then resample the training part only.
So this code should do what you want:
# First divide the data into train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
random_state=0)
# Then only resample the training data
X_resampled, y_resampled = SMOTE().fit_sample(X_train, y_train)
# Then label binarize them to be used in multi-class roc
y_resampled = label_binarize(y_resampled, classes=['Good','Bad','Ok'])
# Do this to the test data too
y_test = label_binarize(y_test, classes=['Good','Bad','Ok'])
y_score=classifier.fit(X_resampled, y_resampled).predict_proba(X_test)
# Then you can do this and other parts of code
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
Upvotes: 1