Naive Bayes in SciKit-learn training with partial_fit breaks because of different array sizes

Question

I am trying to create a sentiment classifier for tweets using SciKit's MultinomialNB classifier. I have a data set containing 1.6 million classified tweets which I want to use to train my classifier. Since doing it all at once takes too much memory, I am trying to do it with partial_fit().

linecounter = 0
classifier = MultinomialNB()
data_frame = DataFrame({'text': [], 'class': []})

# open csv file
with open('training.cleaned.csv', 'rb') as csvfile:

    # parse csv file
    tweet_reader = csv.reader(csvfile, delimiter=',', quotechar='"')

    #loop through each line
    for tweet in tweet_reader:
        data_frame = data_frame.append(DataFrame({'text': [tweet[TEXT].decode('utf8')], 'class': tweet[SENTIMENT]}, index=[tweet[ID]]))
        linecounter += 1

        if linecounter % 100 == 0:          
            count_vectorizer = CountVectorizer(ngram_range=([1, 2]))            
            counts = count_vectorizer.fit_transform(numpy.asarray(data_frame['text'], dtype="|S6"))         
            targets = numpy.asarray(data_frame['class'], dtype="|S6")
            classifier.partial_fit(counts, targets, numpy.asarray(['negative', 'neutral', 'positive']))

For every 100 (for this test) lines I want to train the classifier. The first round goes fine, but the second round it throws an error:

File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 443, in _count self.feature_count_ += safe_sparse_dot(Y.T, X)
ValueError: operands could not be broadcast together with shapes (3,147) (3,246) (3,147)

I know this is caused by the CountVectorizer, because each 100 tweets are completely different, the vectorizer is going to be different. I am not sure how I could solve this... Is there a way I can make multiple vectors of the same size? Or is there another clever trick I could use to partially train my classifier?

mbatchkarov · Accepted Answer

There are two options I can think of:

1) Use a HashingVectorizer instead of a CountVectorizer. The issue with the latter is that it learns a vocabulary when you call fit and it does not support partial fitting (for good reasons). You can find an example of how to use a hashing vectoriser here. This is the recommended method to use when you really have a too much data.

2) In my opinion, a million tweets isn't that much data and you might be able to get away with using a CountVectorizer. Your code does a lot of unnecessary conversions to numpy array and pandas data frame, which is causing your memory issues. If you tidy that up you might be able to train in one go (see below). Also, have a good think if you really need bigram features (ngram_range=(1, 2)) or just unigrams (ngram_range=(1, 1)). Often you gain little accuracy by using bigrams, but the dimensionality of the matrices you have to hold in memory explodes.

with open('training.cleaned.csv', 'rb') as csvfile:
    tweet_reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    data = [tweet[TEXT].decode('utf8') for tweet in tweet_reader]
    targets = [tweet['class'] for tweet in tweet_reader]
count_vectorizer = CountVectorizer(ngram_range=(1, 1))          
counts = count_vectorizer.fit_transform(data)           
classifier = MultinomialNB()    
classifier.fit(counts, targets)

Alternatively, you can manually extract the vocabulary in advance and pass it to the count vectoriser as a constructor parameter. You will then have to call transform just once and won't have to call fit at all.

voc = set([word for tweet in tweet_reader for word in tweet[TEXT].decode('utf8')])
count_vectorizer = CountVectorizer(vocabulary=voc)
with open('training.cleaned.csv', 'rb') as csvfile:
    tweet_reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    data = [tweet[TEXT].decode('utf8') for tweet in tweet_reader]           
counts = count_vectorizer.transform(data)

Naive Bayes in SciKit-learn training with partial_fit breaks because of different array sizes

Answers (1)

Related Questions