Reputation: 17647
Does CountVectorizer
support partial fit?
I would like to train the CountVectorizer
using different batches of data.
Upvotes: 10
Views: 1110
Reputation: 81
The implementation by sajiad is correct and I'm grateful to them for sharing their solution. It could be made more flexible by amending the call to hasattr()
to reference self
instead of vectorizer
.
I've implemented this with a short reproducible example below illustrating the role of partial_fit()
compared to fit()
:
def partial_fit(self , data):
if(hasattr(self , 'vocabulary_')):
vocab = self.vocabulary_
else:
vocab = {}
self.fit(data)
vocab = list(set(vocab.keys()).union(set(self.vocabulary_ )))
self.vocabulary_ = {vocab[i] : i for i in range(len(vocab))}
from sklearn.feature_extraction.text import CountVectorizer
CountVectorizer.partial_fit = partial_fit
vectorizer = CountVectorizer()
corpus = ['The quick brown fox',
'jumps over the lazy dog']
# Without partial fit
for i in corpus:
vectorizer.fit([i])
print(vectorizer.get_feature_names())
['dog', 'jumps', 'lazy', 'over', 'the']
# With partial fit
for i in corpus:
vectorizer.partial_fit([i])
print(vectorizer.get_feature_names())
['over', 'fox', 'lazy', 'quick', 'the', 'jumps', 'dog', 'brown']
Upvotes: 0
Reputation: 379
No, it does not support partial fit.
But you can write a simple method to accomplish your goal:
def partial_fit(self , data):
if(hasattr(vectorizer , 'vocabulary_')):
vocab = self.vocabulary_
else:
vocab = {}
self.fit(data)
vocab = list(set(vocab.keys()).union(set(self.vocabulary_ )))
self.vocabulary_ = {vocab[i] : i for i in range(len(vocab))}
from sklearn.feature_extraction.text import CountVectorizer
CountVectorizer.partial_fit = partial_fit
vectorizer = CountVectorizer(stop_words=l)
vectorizer.fit(df[15].values[0:100])
vectorizer.partial_fit(df[15].values[100:200])
Upvotes: 2