How to get naive Bayes classifier to work?

I tried to use a Naive Bayes classifier to classify my sample corpus. The sample corpus is as follows (stored in myfile.csv):

"Text";"label"
“There be no significant perinephric collection";"label1”
“There be also fluid collection”;”label2”
“No discrete epidural collection or abscess be see";"label1”
“This be highly suggestive of epidural abscess”;”label2”
“No feature of spondylodiscitis be see”;”label1”
“At the level of l2 l3 there be loculated epidural fluid collection”;”label2”

The code for the classifier is as follows:

# libraries for dataset preparation, feature engineering, model training 
import pandas as pd
import csv
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

#Data preparation
data = pd.read_csv(open('myfile.csv'), sep=';', quoting=csv.QUOTE_NONE)

# Creating Bag of Words
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data)
print(X_train_counts.shape)

#From occurrences to frequencies
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
print(X_train_tf.shape)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print(X_train_tfidf.shape)

#Training a classifier
clf = MultinomialNB().fit(X_train_tfidf, data['label'])

#Predicting with the classifier
docs_new = ['there is no spondylodiscitis', 'there is a large fluid collection']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted): 
    print('%r => %s' % (doc, data['label']))

Whenever I try to run the prediction, I get the following error:

KeyError: 'label'

Where am I going wrong?

Upvotes: 0

Answers (3)

Matt Messersmith

Reputation: 13767

When in doubt, load up your code in the REPL or debugger. Observe whatever is in ... is irrelevant to your issue.

import pandas as pd
import csv
...

data = pd.read_csv(open('myfile.csv'), sep=';', quoting=csv.QUOTE_NONE)
import pdb; pdb.set_trace()
...

Now we can query the data object interactively:

(Pdb) data.keys()
Index(['"Text"', '"label"'], dtype='object')
(Pdb) data['"label"']
0    "label1”
1    ”label2”
2    "label1”
3    ”label2”
4    ”label1”
5    ”label2”
Name: "label", dtype: object
(Pdb) data["label"]
*** KeyError: 'label'

Note that the keys are '"Test"' and '"label"', not "Test" and "label". So you can't do data["label"], or you'll get the KeyError that you're seeing. You have to say data['"label"'].

Upvotes: 1