Reputation: 81
I'm now learning naivebayes classifier by using nltk.
In the document(http://www.nltk.org/book/ch06.html) 1.3 document classification, There is an featureset example.
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000] [1]
def document_features(document): [2]
document_words = set(document) [3]
features = {}
for word in word_features:
features['contains({})'.format(word)] = (word in document_words)
return features
So the example of featuresets's form is {('contains(waste)': False, 'contains(lot)': False, ...},'neg')...}
But I want to change word dictionary form from 'contains(waste)': False to 'contains(waste)': 2. I think that that form('contains(waste)': 2) well explain document because it can calculate frequency of world. So the featureset would be {('contains(waste)': 2, 'contains(lot)': 5, ...},'neg')...}
But I'm worried about whether 'contains(waste)': 2 and 'contains(waste)': 1 are totally different word to naivebayesclassifier. Then it can't explain the similarity of 'contains(waste)': 2 and 'contains(waste)': 1.
{'contains(lot)': 1 and 'contains(waste)': 1} and {'contains(waste)': 2 and 'contains(waste)': 1} can be same to program.
Does nltk.naivebayesclassifier can understand the frequency of word?
This is the code I used
def split_and_count_word(data):
#belongs_to : Main
#Role : make featuresets from korean words using konlpy.
#Parameter : dictionary data(dict of contents ex.{'politic':{'parliament': [content,content]}..})
#Return : list featuresets([{'word':True',...},'politic'] == featureset + category)
featuresets = []
twitter = konlpy.tag.Twitter()#Korean word splitter
for big_cat in data:
for small_cat in data[big_cat]:
#save category name needed in featuresets
category = str(big_cat[0:3])+'/'+str(small_cat)
count = 0; print(small_cat)
for one_news in data[big_cat][small_cat]:
count+=1; if count%100==0: print(count,end=' ')
#one_news is list in list so open it!
doc = one_news
#split word as using konlpy
list_of_splited_word = twitter.morphs(doc[:-63])#delete useless sentences.
#get word length is higher than two and get list of splited words
list_of_up_two_word = [word for word in list_of_splited_word if len(word)>1]
dict_of_featuresets = make_featuresets(list_of_up_two_word)
#save
featuresets.append((dict_of_featuresets,category))
return featuresets
def make_featuresets(data):
#belongs_to : split_and_count_word
#Role : make featuresets
#Parameter : list list_of_up_two_word(ex.['비누','떨어','지다']
#Return : dictionary {word : True for word in data}
#PROBLEM :(
#cannot consider the freqency of word
return {word : True for word in data}
def naive_train(featuresets):
#belongs_to : Main
#Role : Learning by naive bayes rule
#Parameter : list featuresets([{'word':True',...},'pol/pal'])
#Return : object classifier(nltk naivebayesclassifier object),
# list test_set(the featuresets that are randomly selected)
random.shuffle(featuresets)
train_set, test_set = featuresets[1000:], featuresets[:1000]
classifier = naivebayes.NaiveBayesClassifier.train(train_set)
return classifier,test_set
featuresets = split_and_count_word(data)
classifier,test_set = naive_train(featuresets)
Upvotes: 1
Views: 542
Reputation: 50200
The nltk's Naive Bayes classifier treats feature values as logically distinct. Values are not limited to True
and False
, but they are never treated as quantities. If you have feature f=2
and f=3
, they count as distinct values. The only way to add quantity to such a model is to sort them into "buckets" like f=1
, f="few"
(2-5), f="several"
(6-10), f="many"
(11 or more), for example. (Note: If you go this route, there are algorithms for choosing good value ranges for the buckets.) And even then the model does not "know" that "few" is between "one" and "several". You'll need a different machine learning tool to handle quantity directly.
Upvotes: 1