Marco
Marco

Reputation: 36

Nltk naive bayesian classifier memory issue

my first post here! I have problems using the nltk NaiveBayesClassifier. I have a training set of 7000 items. Each training item has a description of 2 or 3 worlds and a code. I would like to use the code as label of the class and each world of the description as features. An example:

"My name is Obama", 001 ...

Training set = {[feature['My']=True,feature['name']=True,feature['is']=True,feature[Obama]=True], 001}

Unfortunately, using this approach, the training procedure NaiveBayesClassifier.train use up to 3 GB of ram.. What's wrong in my approach? Thank you!

def document_features(document): # feature extractor
document = set(document)
return dict((w, True) for w in document)

...
words=set()
entries = []
train_set= []
train_length = 2000
readfile = open("atcname.pl", 'r')
t = readfile.readline()
while (t!=""):
  t = t.split("'")
  code = t[0] #class
  desc = t[1] # description
  words = words.union(s) #update dictionary with the new words in the description
  entries.append((s,code))
  t = readfile.readline()
train_set = classify.util.apply_features(document_features, entries[:train_length])
classifier = NaiveBayesClassifier.train(train_set) # Training

Upvotes: 1

Views: 1863

Answers (1)

subiet
subiet

Reputation: 1399

Use nltk.classify.apply_features which returns an object that acts like a list but does not store all the feature sets in memory.

from nltk.classify import apply_features

More Information and a Example here

You are loading the file anyway into the memory, you will need to use some form of lazy loading method. Which will load as per need basis. Consider looking into this

Upvotes: 5

Related Questions