Arkham
Arkham

Reputation: 69

NLTK Classifier Object

I am getting memory error when training the classifier for the whole data set so I divided the data set into small parts and training an individual classifier object for each.

For testing I need a combination of these individual classifier object. So how can I do that. I can store the objects in the pickle file, but then again they will be individual objects only.

I am using NLTK.

Code :

documents = [(list(movie_reviews.words(fileid)), category)
         for category in movie_reviews.categories()
         for fileid in movie_reviews.fileids(category)]

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)
    return features



#print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))
featuresets = [(find_features(rev), category) for (rev, category) in documents]
numtrain = int(len(documents) * 90 / 100)
training_set = featuresets[:numtrain]
testing_set = featuresets[numtrain:]

classifier = nltk.NaiveBayesClassifier.train(training_set)

Upvotes: 1

Views: 417

Answers (1)

mkaran
mkaran

Reputation: 2718

The classifier needs to be trained on the whole data set (the training_set in your code) for you to be able to make correct predictions and tests (on the testing_set), since training more than one classifiers with parts of the dataset will not work - or at least it will not be the optimal solution. I would suggest the following things:

  1. Try to solve the memory error (if you are running on windows and python 32 bit take a look at this: http://gisgeek.blogspot.gr/2012/01/set-32bit-executable-largeaddressaware.html)
  2. Try to optimize your code/ data and maybe use less features, or represent them in a more space/memory efficient way.
  3. If 1 and 2 don't work and want to combine many classifier objects to one (but only when it comes to their predictions), you could try ensemble methods BUT I really believe that this is besides the point of what you are trying to do and is not going to fix the issue you are facing. In any case, here's an example of a MaxVote classifier: https://bitbucket.org/roadrunner_team/large-scale-sentiment-analysis/src/a06d51ef42325293f0296270ca975341c847ab9f/SentimentAnalysis/FigurativeTextAnalysis/models/Classifier_.py?at=master&fileviewer=file-view-default

    class MaxVoteClassifier(object):
        """
            Takes as input a list of pre-trained classifiers and calculates the Frequency Distribution of their predictions
        """
        def __init__(self, classifiers):
            self._classifiers = classifiers
            self.predictions = None
    
        def classify(self, tweet_fea):
            counts = FreqDist()
            for classifier in self._classifiers:
                classifier.set_x_trial([tweet_fea])
                counts[classifier.predict()[0]] += 1
    
            return counts.max()
    

Upvotes: 3

Related Questions