Arkita
Arkita

Reputation: 44

How to implement incremental learning using naive bayes algorithm in python?

I have implemented ML model using naive Bayes algorithm, where I want to implement incremental learning. The issue that I am facing is when I train my model and it generates 1500 features while preprocessing and then after a month using feedback mechanism if I want to train my model with new data which might contain some new features, may be less than or more than 1500 (i.e of my previous dataset) here if I use fit_transform to get the new features then my existing feature set gets lost.

I have been using partial fit but the issue with partial fit is you require same number of features as of previous model. How do I make it learn incrementally?

cv = CountVectorizer()

X = cv.fit_transform(corpus).toarray() #replaces my older feature set

classifier = GaussianNB()

classifier.partial_fit(X,y) 
#does not fit because the size of feature set count is not equal to previous feature set count

Upvotes: 0

Views: 1526

Answers (2)

Venkatachalam
Venkatachalam

Reputation: 16966

You could use just transform() for the CountVectorizer() and then partial_fit() for Naive-Bayes like the following for the incremental learning. Remember, transform extracts the same set of features, which you had learned using the training dataset.

X = cv.transform(corpus)
    
classifier.partial_fit(X,y)

But, you cannot revamp the features all from scratch and continue the incremental leaning. Meaning the number of feature needs to be consistent for any model to do incremental learning.

If you think, your new dataset have significantly different features compared to older one, use cv.fit_transform() and then classifier.fit() on complete dataset (both old and new one), which means we are going to create a new model for the entire available data. You could adopt this, if your dataset not big enough to keep in memory!

Upvotes: 1

Vivek Kumar
Vivek Kumar

Reputation: 36609

You cannot with CountVectorizer. You will need to fix the number of features for partial_fit() in GaussianNB.

Now you can use a different preprocessor (in place of CountVectorizer) which can map the inputs (old and new) to same feature space. Have a look at HashingVectorizer which is recommended by scikit-learn authors to be used in just the scenario you mentioned. While initializing, you will need to specify the number of features you want. In most cases, default value is enough for not having collisions in hashes of different words. You may try experimenting with different numbers. Try using that and check out the performance. If not at par with CountVectorizer then you can do what @AI_Learning suggests and make a new model on the whole data (old+new).

Upvotes: 0

Related Questions