Reputation: 11
I need to produce a confusion matrix using the crosstab function in Python (as an exercise). I have been doing this with various data sets and it worked fine but this time I'm having an odd problem.
The data set is divided into training and test sets (X_train, y_train, X_test, y_test). The test set is a Series of 0s and 1s constituting the response variable. I ran logistic regression on the training set, and predicted the value of the test set:
logit1 = sm.Logit(y_train, X_train).fit()
pred = logit1.predict(X_test)
Then, I use the cut off of 0.5 to classify the value of the response and as a result I have a Series of 0s and 1s of the same length as y_test (2500). This Series is called res and now I want to create the confusion table with crosstab:
cross_table = pd.crosstab(y_test, res, rownames=['Actual'], colnames=['Predicted'], margins=True)
But this gives me the following table which doesn't add up to 2500:
Predicted 0.0 1.0 All
Actual
0.0 413 52 465
1.0 140 20 160
All 553 72 625
While when I use the confusion_matrix function from sklearn, I get the correct total of 2500:
confusion_matrix(y_test, res)
array([[1817, 110],
[ 369, 205]])
What is the problem here with my crosstab????
Packages:
from pandas import Series, DataFrame
import pandas as pd
import statsmodels.api as sm
from sklearn.metrics import confusion_matrix
Full code:
# indexes of train and test were provided in external files:
train = pd.read_csv('/Users//train.csv')
test = pd.read_csv('/Users//test.csv')
X_train = X.iloc[train.values[:,0],:]
X_test = X.iloc[test.values[:,0],:]
y_train = y[train.values[:,0]]
y_test = y[test.values[:,0]]
logit1 = sm.Logit(y_train, X_train).fit()
pred = logit1.predict(X_test)
res = []
for i in pred:
if i >= 0.5:
each = 1
else:
each = 0
res.append(each)
res = Series(res)
cross_table = pd.crosstab(y_test, res, rownames=['Actual'], colnames=['Predicted'], margins=True)
d = confusion_matrix(y_test, res)
Suggested edit:
cross_table = pd.crosstab(y_test, res, rownames=['Actual'],
colnames=['Predicted'], margins=True,dropna=False)
Predicted 0.0 1.0 All
Actual
0.0 413 52 1927
1.0 140 20 574
All 2186 315 4377
Upvotes: 0
Views: 5785
Reputation: 11
While I still don't know why the above didn't work, I figured out what needs to be changed to make it work. The object res, containing the predictions, needs to be saved as an array:
import numpy as np
res = np.array(res)
cross_table = pd.crosstab(y_test, res, rownames=['Actual'], colnames=['Predicted'], margins=True)
Predicted 0 1 All
Actual
0 1817 110 1927
1 369 205 574
All 2186 315 2501
Which is the same as the result from confusion_matrix.
Upvotes: 1
Reputation: 6270
If I do:
import numpy as np
import pandas as pd
data = np.array([1, 1, 0, 0, 0])
data2 = np.array([1, 0, 0, 0, 1])
y_test = pd.Series(data)
res = pd.Series(data2)
and run: pd.crosstab(y_test, res, rownames=['Actual'], colnames=['Predicted'], margins=True)
I get:
which is correct.
And also:
from sklearn.metrics import ocnfusion_matrix
confusion_matrix(y_test, res)
Gives me the correct output, so the error is somewhere else.
Upvotes: 0