Poornesh V
Poornesh V

Reputation: 171

Effective classification of natural text in Sci-kit learn/python

I want my classification algorithm to classify my natural language based raw data based on a set of category if and only if it is going to meet a certain threshold accuracy with respect to a category(say 80% of accuracy) else I want my classifier to classify that particular raw text to a 'unclassified' category. How do I do this?

My example data set:

+----------------------+------------+
| Details              | Category   |
+----------------------+------------+
| Any raw text1        | cat1       |
+----------------------+------------+
| any raw text2        | cat1       |
+----------------------+------------+
| any raw text5        | cat2       |
+----------------------+------------+
| any raw text7        | cat1       |
+----------------------+------------+
| any raw text8        | cat2       |
+----------------------+------------+
| Any raw text4        | cat4       |
+----------------------+------------+
| any raw text5        | cat4       |
+----------------------+------------+
| any raw text6        | cat3       |
+----------------------+------------+

this would be my training data, I'll divide the same data as test set and train set

import pandas as pd
import numpy as np
import scipy as sp
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt  
from sklearn.model_selection import train_test_split 
data= pd.read_csv('mydata.xls.gold', delimiter='\t',usecols=
['Details','Category'],encoding='utf-8')
target_one=data['Category']
target_list=data['Category'].unique()         
x_train, x_test, y_train, y_test = train_test_split(data.Details, 
data.NUM_CATEGORY, random_state=42)
vect = CountVectorizer(ngram_range=(1,2))
#converting traning features into numeric vector
X_train = vect.fit_transform(x_train.values.astype('U'))
#converting training labels into numeric vector
X_test = vect.transform(x_test.values.astype('U'))
start = time.clock()

mnb = MultinomialNB(alpha =0.13)

mnb.fit(X_train,y_train)

result= mnb.predict(X_test)

print (time.clock()-start)

# mnb.predict_proba(x_test)[0:10,1]
accuracy_score(result,y_test)

How do I proceed ? Is there any parameter that needs to be set for the classifier? Thanks in advance.

Upvotes: 1

Views: 161

Answers (1)

sgDysregulation
sgDysregulation

Reputation: 4417

You can use predict_proba result and create a pandas data-frame with columns = target_list then use max and idxmax to find the category with the highest probability for each element in the test set. once that is done you can use boolean masking and broadcasting to set the categories that's below the threshold to "unclassified"

import pandas as pd

df = pd.DataFrame(clf.predict_proba(X_test), columns=target_list)
res_df = pd.DataFrame()

res_df['max_prob'] = df.max(axis=1)
res_df['max_prob_cat'] = df.idxmax(axis=1)

res_df.loc[res_df['max_prob'] < .8, 'max_prob_cat'] = 'unclassified'

df will look like below

              cat1          cat2          cat3          cat4
0     1.091685e-06  2.257549e-04  9.994661e-01  3.070665e-04
1     2.288312e-02  9.752170e-01  1.783878e-03  1.159706e-04
2     1.980685e-01  3.494765e-01  4.416871e-01  1.076788e-02
3     2.205478e-07  9.999601e-01  3.276864e-05  6.920325e-06
4     2.736805e-03  9.795997e-01  1.718200e-02  4.815429e-04

res_df will look like

      max_prob  max_prob_cat
0     0.999466          cat3
1     0.975217          cat2
2     0.441687  unclassified
3     0.999960          cat2
4     0.979600          cat2
5     0.999956          cat2
6     0.998864          cat3
7     0.996888          cat3
8     0.999422          cat1
9     0.994412          cat3
10    0.954508          cat2
11    0.999999          cat2

Upvotes: 1

Related Questions