Reputation: 161
I'm working with a environment that generates data at each iteration. I want to retain the model from previous iteration and add new data to the existing model.
I want to understand how model fit works. Will it fit the new data with the existing model or will it create a new model with the new data.
calling fit with the new data:
clf = RandomForestClassifier(n_estimators=100)
for i in customRange:
get_data()
clf.fit(new_train_data) #directly fitting new train data
clf.predict(new_test_data)
Or Saving the history of train data and calling fit over all the historic data is the only solution
clf = RandomForestClassifier(n_estimators=100)
global_train_data = new dict()
for i in customRange:
get_data()
global_train_data.append(new_train_data) #Appending new train data
clf.fit(global_train_data) #Fitting on global train data
clf.predict(new_test_data)
My goal is to train model efficiently so i don't want to waste CPU time re-learning models.
I want to confirm the right approach and also want to know if that approach is consistent across all classifiers
Upvotes: 1
Views: 2289
Reputation: 60318
Your second approach is "correct", in the sense that, as you have already guessed, it will fit a new classifier from scratch each time the data is appended; but arguably this is not what you are looking for.
What you are actually looking for is the argument warm_start
; from the docs:
warm_start : bool, optional (default=False)
When set to
True
, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See the Glossary.
So, you should use your 1st approach, with the following modification:
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
This is not necessarily consistent across classifiers (some come with a partial_fit
method instead) - see for example Is it possible to train a sklearn model (eg SVM) incrementally? for the SGDClasssifier
; you should always check the relevant documentation.
Upvotes: 6