NLTK Classifier Object

Question

I am getting memory error when training the classifier for the whole data set so I divided the data set into small parts and training an individual classifier object for each.

For testing I need a combination of these individual classifier object. So how can I do that. I can store the objects in the pickle file, but then again they will be individual objects only.

I am using NLTK.

Code :

documents = [(list(movie_reviews.words(fileid)), category)
         for category in movie_reviews.categories()
         for fileid in movie_reviews.fileids(category)]

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)
    return features



#print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))
featuresets = [(find_features(rev), category) for (rev, category) in documents]
numtrain = int(len(documents) * 90 / 100)
training_set = featuresets[:numtrain]
testing_set = featuresets[numtrain:]

classifier = nltk.NaiveBayesClassifier.train(training_set)

mkaran · Accepted Answer

The classifier needs to be trained on the whole data set (the training_set in your code) for you to be able to make correct predictions and tests (on the testing_set), since training more than one classifiers with parts of the dataset will not work - or at least it will not be the optimal solution. I would suggest the following things:

Try to solve the memory error (if you are running on windows and python 32 bit take a look at this: http://gisgeek.blogspot.gr/2012/01/set-32bit-executable-largeaddressaware.html)
Try to optimize your code/ data and maybe use less features, or represent them in a more space/memory efficient way.
If 1 and 2 don't work and want to combine many classifier objects to one (but only when it comes to their predictions), you could try ensemble methods BUT I really believe that this is besides the point of what you are trying to do and is not going to fix the issue you are facing. In any case, here's an example of a MaxVote classifier: https://bitbucket.org/roadrunner_team/large-scale-sentiment-analysis/src/a06d51ef42325293f0296270ca975341c847ab9f/SentimentAnalysis/FigurativeTextAnalysis/models/Classifier_.py?at=master&fileviewer=file-view-default
```
class MaxVoteClassifier(object):
    """
        Takes as input a list of pre-trained classifiers and calculates the Frequency Distribution of their predictions
    """
    def __init__(self, classifiers):
        self._classifiers = classifiers
        self.predictions = None

    def classify(self, tweet_fea):
        counts = FreqDist()
        for classifier in self._classifiers:
            classifier.set_x_trial([tweet_fea])
            counts[classifier.predict()[0]] += 1

        return counts.max()
```

NLTK Classifier Object

Answers (1)

Related Questions