Reputation: 119
I am working on building a binary classifier in scikit-learn that will classify text reviews. The basic workflow includes the following:
#Splitting the data into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.20, random_state=42)
#Instantiate a model
nb = MultinomialNB()
#Train the model.
nb.fit(X_train, y_train)
#Make predictions using the trained model
y_pred_class = nb.predict(X_test)
#View confusion matrix
confusion_matrix(y_test, y_pred_class)
#Output of confusion matrix
array([[295, 13],
[ 80, 70]])
Based on the confusion matrix, there are 13 false positives and 80 false negatives.
I want to see the 13 text reviews that are being classified as being a false positive.
I followed this post to see if I can get a list of the 13 entires that are being classified as false positives.
However, when I run the following:
X_test[y_test != y_pred_class]
I get the following object:
<458x758 sparse matrix of type '<class 'numpy.float64'>'
with 16890 stored elements in Compressed Sparse Row format>
This appears to return all of the values in X_test
(458 total entries). I expected an object that was less than 458 entries.
I also expected to see the text data of X_test
as opposed to an object.
My question is this:
How can I return the 13 entries from X_test
that were misclassified as false positives? I am looking for an output that looks like the example below.
2175 This has to be the worst restaurant in terms o...
1781 If you like the stuck up Scottsdale vibe this ...
2674 I'm sorry to be what seems to be the lone one ...
Name: text, dtype: object
Upvotes: 2
Views: 2741
Reputation: 36619
For false positives, you need to also check for values which are 1 in y_pred_class
, in addition to y_test != y_pred_class
.
Try this:
import numpy as np
false_positives = np.logical_and(y_test != y_pred_class, y_pred_class == 1)
X_test[false_positives]
Upvotes: 4