Reputation: 36
my first post here! I have problems using the nltk NaiveBayesClassifier. I have a training set of 7000 items. Each training item has a description of 2 or 3 worlds and a code. I would like to use the code as label of the class and each world of the description as features. An example:
"My name is Obama", 001 ...
Training set = {[feature['My']=True,feature['name']=True,feature['is']=True,feature[Obama]=True], 001}
Unfortunately, using this approach, the training procedure NaiveBayesClassifier.train use up to 3 GB of ram.. What's wrong in my approach? Thank you!
def document_features(document): # feature extractor
document = set(document)
return dict((w, True) for w in document)
...
words=set()
entries = []
train_set= []
train_length = 2000
readfile = open("atcname.pl", 'r')
t = readfile.readline()
while (t!=""):
t = t.split("'")
code = t[0] #class
desc = t[1] # description
words = words.union(s) #update dictionary with the new words in the description
entries.append((s,code))
t = readfile.readline()
train_set = classify.util.apply_features(document_features, entries[:train_length])
classifier = NaiveBayesClassifier.train(train_set) # Training
Upvotes: 1
Views: 1863
Reputation: 1399
Use nltk.classify.apply_features
which returns an object that acts like a list but does not store all the feature sets in memory.
from nltk.classify import apply_features
More Information and a Example here
You are loading the file anyway into the memory, you will need to use some form of lazy loading method. Which will load as per need basis. Consider looking into this
Upvotes: 5