Reputation: 844
I tried to use a Naive Bayes classifier to classify my sample corpus. The sample corpus is as follows (stored in myfile.csv):
"Text";"label"
“There be no significant perinephric collection";"label1”
“There be also fluid collection”;”label2”
“No discrete epidural collection or abscess be see";"label1”
“This be highly suggestive of epidural abscess”;”label2”
“No feature of spondylodiscitis be see”;”label1”
“At the level of l2 l3 there be loculated epidural fluid collection”;”label2”
The code for the classifier is as follows:
# libraries for dataset preparation, feature engineering, model training
import pandas as pd
import csv
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
#Data preparation
data = pd.read_csv(open('myfile.csv'), sep=';', quoting=csv.QUOTE_NONE)
# Creating Bag of Words
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data)
print(X_train_counts.shape)
#From occurrences to frequencies
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
print(X_train_tf.shape)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print(X_train_tfidf.shape)
#Training a classifier
clf = MultinomialNB().fit(X_train_tfidf, data['label'])
#Predicting with the classifier
docs_new = ['there is no spondylodiscitis', 'there is a large fluid collection']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
print('%r => %s' % (doc, data['label']))
Whenever I try to run the prediction, I get the following error:
KeyError: 'label'
Where am I going wrong?
Upvotes: 0
Views: 222
Reputation: 13767
When in doubt, load up your code in the REPL or debugger. Observe whatever is in ...
is irrelevant to your issue.
import pandas as pd
import csv
...
data = pd.read_csv(open('myfile.csv'), sep=';', quoting=csv.QUOTE_NONE)
import pdb; pdb.set_trace()
...
Now we can query the data
object interactively:
(Pdb) data.keys()
Index(['"Text"', '"label"'], dtype='object')
(Pdb) data['"label"']
0 "label1”
1 ”label2”
2 "label1”
3 ”label2”
4 ”label1”
5 ”label2”
Name: "label", dtype: object
(Pdb) data["label"]
*** KeyError: 'label'
Note that the keys are '"Test"'
and '"label"'
, not "Test"
and "label"
. So you can't do data["label"]
, or you'll get the KeyError
that you're seeing. You have to say data['"label"']
.
Upvotes: 1
Reputation: 1319
if you want to be able to access pandas column with data['label']
,
your first line should be :
Text;label
not this :
"Text";"label"
this way you have to index your label col like this ;
data['"label"']
which is not looks fine
Upvotes: 0
Reputation: 96
it looks like your data has quotes, why have you got QUOTE_NONE specified there?
Upvotes: 0