Reputation: 15
I'm trying to list all the wrong predictions in a test set, but quite unsure how to do it. I tried Stackoverflow, but might have searched for the wrong "problem". So I have these text files from a folder, containing emails. The problems is that my predictions isn't doing to well, and I want to inspect the emails that is predicted wrong. Currently a snippet of my code looks something like this:
no_head_train_path_0 = 'folder_name'
no_head_train_path_1 = 'folder_name'
def get_data(path):
text_list = list()
files = os.listdir(path)
for text_file in files:
file_path = os.path.join(path, text_file)
read_file = open(file_path,'r+')
read_text = read_file.read()
read_file.close()
cleaned_text = clean_text(read_text)
text_list.append(cleaned_text)
return text_list, files
no_head_train_0, temp = get_data(no_head_train_path_0)
no_head_train_1, temp1 = get_data(no_head_train_path_1)
no_head_train = no_head_train_0 + no_head_train_1
no_head_labels_train = ([0] * len(no_head_train_0)) + ([1] * len(no_head_train_1))
def vocabularymat(TEXTFILES,VOC,PLAY,METHOD):
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
if (METHOD == "TDM"):
voc = CountVectorizer()
voc.fit(VOC)
if (PLAY == "TRAIN"):
TrainMat = voc.transform(TEXTFILES)
return TrainMat
if (PLAY =="TEST"):
TestMat = voc.transform(TEXTFILES)
return TestMat
TrainMat = vocabularymat(no_head_train,no_head_train,PLAY= "TRAIN",METHOD="TDM")
X_train = Featurelearning(Traindata, Method="NMF")
y_train = datalabel
X_train, X_test, y_train, y_test = train_test_split(data, datalabel, test_size=0.33,
random_state=42
model = LogisticRegression()
model.fit(X_train, y_train)
expected = y_test
predicted = model.predict(X_test)
proba = model.predict_proba(X_test)
accuracy = accuracy_score(expected, predicted)
recall = recall_score(expected, predicted, average="binary")
precision = precision_score(expected, predicted , average="binary")
f1 = f1_score(expected, predicted , average="binary")
Is it possible to find the emails/filename that are predicted wrong, so I can manually inspect them? (Sorry for the long code)
Upvotes: 1
Views: 1129
Reputation: 887
# find the wrong prediction
prediction = model.predict(x_test)
# save the wrong predicted values
wrong_predict = []
for order, value in enumerate(y_test):
if y_test[order] != prediction[order].argmax():
wrong_predict.append(order)
print(wrong_predict)
Upvotes: 0
Reputation: 10545
You can use NumPy to create a Boolean vector indicating which predictions are wrong, and then use that vector to index your array of file names. For example:
import numpy as np
# mock data
files = np.array(['mail1.txt', 'mail2.txt', 'mail3.txt', 'mail4.txt'])
y_test = np.array([0, 0, 1, 1])
predicted = np.array([0, 1, 0, 1])
# create a Boolean index for the wrong classifications
classification_is_wrong = y_test != predicted
# print the file names of the wrongly classified mails
print(files[classification_is_wrong])
Output:
['mail2.txt' 'mail3.txt']
Upvotes: 1