Sachin Hegde
Sachin Hegde

Reputation: 161

Incrementally fitting sklearn RandomForestClassifier

I'm working with a environment that generates data at each iteration. I want to retain the model from previous iteration and add new data to the existing model.
I want to understand how model fit works. Will it fit the new data with the existing model or will it create a new model with the new data.

calling fit with the new data:

clf = RandomForestClassifier(n_estimators=100)
for i in customRange:
    get_data()
    clf.fit(new_train_data) #directly fitting new train data
    clf.predict(new_test_data)

Or Saving the history of train data and calling fit over all the historic data is the only solution

clf = RandomForestClassifier(n_estimators=100)
global_train_data = new dict()
for i in customRange:
    get_data()
    global_train_data.append(new_train_data)  #Appending new train data 
    clf.fit(global_train_data) #Fitting on global train data
    clf.predict(new_test_data)

My goal is to train model efficiently so i don't want to waste CPU time re-learning models.

I want to confirm the right approach and also want to know if that approach is consistent across all classifiers

Upvotes: 1

Views: 2289

Answers (1)

desertnaut
desertnaut

Reputation: 60318

Your second approach is "correct", in the sense that, as you have already guessed, it will fit a new classifier from scratch each time the data is appended; but arguably this is not what you are looking for.

What you are actually looking for is the argument warm_start; from the docs:

warm_start : bool, optional (default=False)

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See the Glossary.

So, you should use your 1st approach, with the following modification:

clf = RandomForestClassifier(n_estimators=100, warm_start=True)

This is not necessarily consistent across classifiers (some come with a partial_fit method instead) - see for example Is it possible to train a sklearn model (eg SVM) incrementally? for the SGDClasssifier; you should always check the relevant documentation.

Upvotes: 6

Related Questions