Reputation: 15
I am having an issue training my Naive Bayes Classifier. I have a feature set and targets that I want to use but I keep getting errors. I've had a look at other people who have similar problems but I can't seem to figure out the issue. I'm sure there's a simple solution but I'm yet to find it.
Here's an example of the structure of the data that I'm trying to use to train the classifier.
In [1] >> train[0]
Out[1] ({
u'profici': [False],
u'saver': [False],
u'four': [True],
u'protest': [False],
u'asian': [True],
u'upsid': [False],
.
.
.
u'captain': [False],
u'payoff': [False],
u'whose': [False]
},
0)
Where train[0] is the first tuple in a list and contains:
A dictionary of features and boolean values to indicate the presence or absence of words in document[0]
The target label for the binary classification of document[0]
Obviously, the rest of the train list has the features and labels for the other documents that I want to classify.
When running the following code
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
MNB_clf = SklearnClassifier(MultinomialNB())
MNB_clf.train(train)
I get the error message:
TypeError: float() argument must be a string or a number
Edit:
features are created here. From a dataframe post_sent that contains the posts in column 1 and the sentiment classification in column 2.
stopwords = set(stopwords.words('english'))
tokenized = []
filtered_posts = []
punc_tokenizer = RegexpTokenizer(r'\w+')
# tokenizing and removing stopwords
for post in post_sent.post:
tokenized = [word.lower() for word in.
punc_tokenizer.tokenize(post)]
filtered = ([w for w in tokenized if not w in stopwords])
filtered_posts.append(filtered)
# stemming
tokened_stemmed = []
for post in filtered_posts:
stemmed = []
for w in post:
stemmed.append(PorterStemmer().stem_word(w))
tokened_stemmed.append(stemmed)
#frequency dist
all_words =.
list(itertools.chain.from_iterable(tokened_stemmed))
frequency = FreqDist(all_words)
# Feature selection
word_features = list(frequency.keys())[:3000]
# IMPORTANT PART
#######################
#------ featuresets creation ---------
def find_features(list_of_posts):
features = {}
wrds = set(post)
for w in word_features:
features[w] = [w in wrds]
return features
# zipping inputs with targets
words_and_sent = zip(tokened_stemmed,
post_sent.sentiment)
# IMPORTANT PART
##########################
# feature sets created here
featuresets = [(find_features(words), sentiment) for
words, sentiment in
words_and_sent]
Upvotes: 0
Views: 488
Reputation: 15
Thanks to help from both Vivek & Lenz, who explained to me the problem, I was able to reorganise my training set and thankfully it now works. Thanks guys!
The problem was very well explained in Vivek's post. This is the code that reorganised the train data into the correct format.
features_targ = []
for feature in range(0,len(featuresets)):
dict_test = featuresets[feature]
values = list(itertools.chain.from_iterable(dict_test[0].values()))
keys = dict_test[0].keys()
target = dict_test[1]
dict_ion = {}
for key in range(x,len(keys)):
dict_ion[keys[key]] = values[key]
features_targ.append((dict_ion,target))
Upvotes: 1
Reputation: 36599
You are setting the train wrong. As @lenz said in comment, remove the parentheses in the feature dict values and only use single values.
As given in the official documentation:
labeled_featuresets – A list of (featureset, label) where each featureset is a dict mapping strings to either numbers, booleans or strings.
But you are setting the mapping (value of key in dict) as a list.
You correct train should look like :
[({u'profici':False,
u'saver':False,
u'four':True,
u'protest':False,
u'asian':True,
u'upsid':False,
.
.
}, 0),
..
..
({u'profici':True,
u'saver':False,
u'four':False,
u'protest':False,
u'asian':True,
u'upsid':False,
.
.
}, 1)]
You can take a look at more examples here: - http://www.nltk.org/howto/classify.html
Upvotes: 0