Reputation: 44
I have implemented ML model using naive Bayes algorithm, where I want to implement incremental learning. The issue that I am facing is when I train my model and it generates 1500 features while preprocessing and then after a month using feedback mechanism if I want to train my model with new data which might contain some new features, may be less than or more than 1500 (i.e of my previous dataset) here if I use fit_transform
to get the new features then my existing feature set gets lost.
I have been using partial fit but the issue with partial fit is you require same number of features as of previous model. How do I make it learn incrementally?
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray() #replaces my older feature set
classifier = GaussianNB()
classifier.partial_fit(X,y)
#does not fit because the size of feature set count is not equal to previous feature set count
Upvotes: 0
Views: 1526
Reputation: 16966
You could use just transform()
for the CountVectorizer()
and then partial_fit()
for Naive-Bayes like the following for the incremental learning. Remember, transform
extracts the same set of features, which you had learned using the training dataset.
X = cv.transform(corpus)
classifier.partial_fit(X,y)
But, you cannot revamp the features all from scratch and continue the incremental leaning. Meaning the number of feature needs to be consistent for any model to do incremental learning.
If you think, your new dataset have significantly different features compared to older one, use cv.fit_transform() and then classifier.fit() on complete dataset (both old and new one), which means we are going to create a new model for the entire available data. You could adopt this, if your dataset not big enough to keep in memory!
Upvotes: 1
Reputation: 36609
You cannot with CountVectorizer
. You will need to fix the number of features for partial_fit()
in GaussianNB
.
Now you can use a different preprocessor (in place of CountVectorizer
) which can map the inputs (old and new) to same feature space. Have a look at HashingVectorizer
which is recommended by scikit-learn authors to be used in just the scenario you mentioned. While initializing, you will need to specify the number of features you want. In most cases, default value is enough for not having collisions in hashes of different words. You may try experimenting with different numbers. Try using that and check out the performance. If not at par with CountVectorizer
then you can do what @AI_Learning suggests and make a new model on the whole data (old+new).
Upvotes: 0