Reputation: 2385
How do I retrain my existing machine learning model in sklearn python?
I have thousands of records using which I have trained my model and dumped as .pkl
file using pickle
.
While training the model for the first time, I have used the warmStart = True
parameter while creating the logistic regression object.
Sample Code:
log_regression_model = linear_model.LogisticRegression(warm_start = True)
log_regression_model.fit(X, Y)
# Saved this model as .pkl file on filesystem like pickle.dump(model,open('model.pkl', wb))
I want to keep this up to date with the new data I will getting daily. For that I am opening the existing model file and get the new data of last 24 hours and train it again./
Sample Code:
#open the model from filesystem
log_regression_model = pickle.load(open('model.pkl','rb'))
log_regression_model.fit(X, Y) # New X, Y here is data of last 24 hours only. Few hundreds records only.
But, when I retrain model by loading it from filesystem, it seems to erase the existing model which is created with thousands of records, and create the new one with few hundreds of records from last 24 hours (model with thousands records is 3MB in size on filesystem, while new retrained model is only 67KB)
I have tried using warmStart option. How do I retrain my LogisticRegression model?
Upvotes: 6
Views: 14218
Reputation: 2378
When you use fit
on trained model you basically discard all the previous information.
Scikit-learn has some models that have partial_fit
method that can be used for incremental training, as in documentation.
I don't remember if it's possible to retrain Logistic Regression in sklearn, but sklearn has SGDClassifier
which with loss=log
runs Logistic Regression with Stochastic Gradient Descent optimization, and it has partial_fit
method.
Upvotes: 9
Reputation: 3785
The size of the LogicsticRegression
object isn't tied to how many samples are used to train it.
from sklearn.linear_model import LogisticRegression
import pickle
import sys
np.random.seed(0)
X, y = np.random.randn(100000, 1), np.random.randint(2, size=(100000,))
log_regression_model = LogisticRegression(warm_start=True)
log_regression_model.fit(X, y)
print(sys.getsizeof(pickle.dumps(log_regression_model)))
np.random.seed(0)
X, y = np.random.randn(100, 1), np.random.randint(2, size=(100,))
log_regression_model = LogisticRegression(warm_start=True)
log_regression_model.fit(X, y)
print(sys.getsizeof(pickle.dumps(log_regression_model)))
results in
1230
1233
You might be saving the wrong model object. Make sure you're saving log_regression_model.
pickle.dump(log_regression_model, open('model.pkl', 'wb'))
With the model sizes so different, and the fact that LogisticRegression
objects don't change their size with different numbers of training samples, it looks like different code is being used to generate your saved model and this new "retrained" model.
All that said, it also looks like warm_start isn't doing anything here:
np.random.seed(0)
X, y = np.random.randn(200, 1), np.random.randint(2, size=(200,))
log_regression_model = LogisticRegression(warm_start=True)
log_regression_model.fit(X[:100], y[:100])
print(log_regression_model.intercept_, log_regression_model.coef_)
log_regression_model.fit(X[100:], y[100:])
print(log_regression_model.intercept_, log_regression_model.coef_)
log_regression_model = LogisticRegression(warm_start=False)
log_regression_model.fit(X[100:], y[100:])
print(log_regression_model.intercept_, log_regression_model.coef_)
log_regression_model = LogisticRegression(warm_start=False)
log_regression_model.fit(X, y)
print(log_regression_model.intercept_, log_regression_model.coef_)
gives:
(array([ 0.01846266]), array([[-0.32172516]]))
(array([ 0.17253402]), array([[ 0.33734497]]))
(array([ 0.17253402]), array([[ 0.33734497]]))
(array([ 0.09707612]), array([[ 0.01501025]]))
Based on this other question, warm_start
will have some effect if you use another solver (e.g. LogisticRegression(warm_start=True, solver='sag')
), but it still won't be the same as re-training on the entire dataset with the new data added. For example, the above four outputs become:
(array([ 0.01915884]), array([[-0.32176053]]))
(array([ 0.17973458]), array([[ 0.33708208]]))
(array([ 0.17968324]), array([[ 0.33707362]]))
(array([ 0.09903978]), array([[ 0.01488605]]))
You can see the middle two lines are different, but not very different. All it does is use the parameters from the last model as a starting point for re-training the new model with the new data. It sounds like what you want to do is save the data, and re-train it with the old data and new data combined every time you add data.
Upvotes: 1