Reputation: 171
I want my classification algorithm to classify my natural language based raw data based on a set of category if and only if it is going to meet a certain threshold accuracy with respect to a category(say 80% of accuracy) else I want my classifier to classify that particular raw text to a 'unclassified' category. How do I do this?
My example data set:
+----------------------+------------+
| Details | Category |
+----------------------+------------+
| Any raw text1 | cat1 |
+----------------------+------------+
| any raw text2 | cat1 |
+----------------------+------------+
| any raw text5 | cat2 |
+----------------------+------------+
| any raw text7 | cat1 |
+----------------------+------------+
| any raw text8 | cat2 |
+----------------------+------------+
| Any raw text4 | cat4 |
+----------------------+------------+
| any raw text5 | cat4 |
+----------------------+------------+
| any raw text6 | cat3 |
+----------------------+------------+
this would be my training data, I'll divide the same data as test set and train set
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
data= pd.read_csv('mydata.xls.gold', delimiter='\t',usecols=
['Details','Category'],encoding='utf-8')
target_one=data['Category']
target_list=data['Category'].unique()
x_train, x_test, y_train, y_test = train_test_split(data.Details,
data.NUM_CATEGORY, random_state=42)
vect = CountVectorizer(ngram_range=(1,2))
#converting traning features into numeric vector
X_train = vect.fit_transform(x_train.values.astype('U'))
#converting training labels into numeric vector
X_test = vect.transform(x_test.values.astype('U'))
start = time.clock()
mnb = MultinomialNB(alpha =0.13)
mnb.fit(X_train,y_train)
result= mnb.predict(X_test)
print (time.clock()-start)
# mnb.predict_proba(x_test)[0:10,1]
accuracy_score(result,y_test)
How do I proceed ? Is there any parameter that needs to be set for the classifier? Thanks in advance.
Upvotes: 1
Views: 161
Reputation: 4417
You can use predict_proba
result and create a pandas data-frame with columns = target_list
then use max
and idxmax
to find the category with the highest probability for each element in the test set. once that is done you can use boolean masking and broadcasting to set the categories that's below the threshold to "unclassified"
import pandas as pd
df = pd.DataFrame(clf.predict_proba(X_test), columns=target_list)
res_df = pd.DataFrame()
res_df['max_prob'] = df.max(axis=1)
res_df['max_prob_cat'] = df.idxmax(axis=1)
res_df.loc[res_df['max_prob'] < .8, 'max_prob_cat'] = 'unclassified'
df will look like below
cat1 cat2 cat3 cat4
0 1.091685e-06 2.257549e-04 9.994661e-01 3.070665e-04
1 2.288312e-02 9.752170e-01 1.783878e-03 1.159706e-04
2 1.980685e-01 3.494765e-01 4.416871e-01 1.076788e-02
3 2.205478e-07 9.999601e-01 3.276864e-05 6.920325e-06
4 2.736805e-03 9.795997e-01 1.718200e-02 4.815429e-04
res_df will look like
max_prob max_prob_cat
0 0.999466 cat3
1 0.975217 cat2
2 0.441687 unclassified
3 0.999960 cat2
4 0.979600 cat2
5 0.999956 cat2
6 0.998864 cat3
7 0.996888 cat3
8 0.999422 cat1
9 0.994412 cat3
10 0.954508 cat2
11 0.999999 cat2
Upvotes: 1