Nltk naive bayesian classifier memory issue

Question

my first post here! I have problems using the nltk NaiveBayesClassifier. I have a training set of 7000 items. Each training item has a description of 2 or 3 worlds and a code. I would like to use the code as label of the class and each world of the description as features. An example:

"My name is Obama", 001 ...

Training set = {[feature['My']=True,feature['name']=True,feature['is']=True,feature[Obama]=True], 001}

Unfortunately, using this approach, the training procedure NaiveBayesClassifier.train use up to 3 GB of ram.. What's wrong in my approach? Thank you!

def document_features(document): # feature extractor
document = set(document)
return dict((w, True) for w in document)

...
words=set()
entries = []
train_set= []
train_length = 2000
readfile = open("atcname.pl", 'r')
t = readfile.readline()
while (t!=""):
  t = t.split("'")
  code = t[0] #class
  desc = t[1] # description
  words = words.union(s) #update dictionary with the new words in the description
  entries.append((s,code))
  t = readfile.readline()
train_set = classify.util.apply_features(document_features, entries[:train_length])
classifier = NaiveBayesClassifier.train(train_set) # Training

Nltk naive bayesian classifier memory issue

Answers (1)

Related Questions