Donbeo
Donbeo

Reputation: 17647

sklearn partial fit of CountVectorizer

Does CountVectorizer support partial fit?

I would like to train the CountVectorizer using different batches of data.

Upvotes: 10

Views: 1110

Answers (2)

tomlincr
tomlincr

Reputation: 81

The implementation by sajiad is correct and I'm grateful to them for sharing their solution. It could be made more flexible by amending the call to hasattr() to reference self instead of vectorizer.

I've implemented this with a short reproducible example below illustrating the role of partial_fit() compared to fit():

def partial_fit(self , data):
    if(hasattr(self , 'vocabulary_')):
        vocab = self.vocabulary_
    else:
        vocab = {}
    self.fit(data)
    vocab = list(set(vocab.keys()).union(set(self.vocabulary_ )))
    self.vocabulary_ = {vocab[i] : i for i in range(len(vocab))}

from sklearn.feature_extraction.text import CountVectorizer
CountVectorizer.partial_fit = partial_fit

vectorizer = CountVectorizer()

corpus = ['The quick brown fox',
'jumps over the lazy dog']

# Without partial fit
for i in corpus:
    vectorizer.fit([i])

print(vectorizer.get_feature_names())

['dog', 'jumps', 'lazy', 'over', 'the']

# With partial fit
for i in corpus:
    vectorizer.partial_fit([i])

print(vectorizer.get_feature_names())

['over', 'fox', 'lazy', 'quick', 'the', 'jumps', 'dog', 'brown']

Upvotes: 0

sajjad
sajjad

Reputation: 379

No, it does not support partial fit.

But you can write a simple method to accomplish your goal:

def partial_fit(self , data):
    if(hasattr(vectorizer , 'vocabulary_')):
        vocab = self.vocabulary_
    else:
        vocab = {}
    self.fit(data)
    vocab = list(set(vocab.keys()).union(set(self.vocabulary_ )))
    self.vocabulary_ = {vocab[i] : i for i in range(len(vocab))}

from sklearn.feature_extraction.text import CountVectorizer
CountVectorizer.partial_fit = partial_fit

vectorizer = CountVectorizer(stop_words=l)
vectorizer.fit(df[15].values[0:100])
vectorizer.partial_fit(df[15].values[100:200])

Upvotes: 2

Related Questions