ajay_t
ajay_t

Reputation: 2385

How to retrain logistic regression model in sklearn with new data

How do I retrain my existing machine learning model in sklearn python?

I have thousands of records using which I have trained my model and dumped as .pkl file using pickle. While training the model for the first time, I have used the warmStart = True parameter while creating the logistic regression object.

Sample Code:

 log_regression_model =  linear_model.LogisticRegression(warm_start = True)
 log_regression_model.fit(X, Y)
 # Saved this model as .pkl file on filesystem like pickle.dump(model,open('model.pkl', wb))

I want to keep this up to date with the new data I will getting daily. For that I am opening the existing model file and get the new data of last 24 hours and train it again./

Sample Code:

#open the model from filesystem
log_regression_model = pickle.load(open('model.pkl','rb'))
log_regression_model.fit(X, Y) # New X, Y here is data of last 24 hours only. Few hundreds records only.

But, when I retrain model by loading it from filesystem, it seems to erase the existing model which is created with thousands of records, and create the new one with few hundreds of records from last 24 hours (model with thousands records is 3MB in size on filesystem, while new retrained model is only 67KB)

I have tried using warmStart option. How do I retrain my LogisticRegression model?

Upvotes: 6

Views: 14218

Answers (2)

Jakub Bartczuk
Jakub Bartczuk

Reputation: 2378

When you use fit on trained model you basically discard all the previous information.

Scikit-learn has some models that have partial_fit method that can be used for incremental training, as in documentation.

I don't remember if it's possible to retrain Logistic Regression in sklearn, but sklearn has SGDClassifier which with loss=log runs Logistic Regression with Stochastic Gradient Descent optimization, and it has partial_fit method.

Upvotes: 9

Jeremy McGibbon
Jeremy McGibbon

Reputation: 3785

The size of the LogicsticRegression object isn't tied to how many samples are used to train it.

from sklearn.linear_model import LogisticRegression
import pickle
import sys

np.random.seed(0)
X, y = np.random.randn(100000, 1), np.random.randint(2, size=(100000,))
log_regression_model = LogisticRegression(warm_start=True)
log_regression_model.fit(X, y)
print(sys.getsizeof(pickle.dumps(log_regression_model)))

np.random.seed(0)
X, y = np.random.randn(100, 1), np.random.randint(2, size=(100,))
log_regression_model = LogisticRegression(warm_start=True)
log_regression_model.fit(X, y)
print(sys.getsizeof(pickle.dumps(log_regression_model)))

results in

1230
1233

You might be saving the wrong model object. Make sure you're saving log_regression_model.

pickle.dump(log_regression_model, open('model.pkl', 'wb'))

With the model sizes so different, and the fact that LogisticRegression objects don't change their size with different numbers of training samples, it looks like different code is being used to generate your saved model and this new "retrained" model.

All that said, it also looks like warm_start isn't doing anything here:

np.random.seed(0)
X, y = np.random.randn(200, 1), np.random.randint(2, size=(200,))

log_regression_model = LogisticRegression(warm_start=True)
log_regression_model.fit(X[:100], y[:100])
print(log_regression_model.intercept_, log_regression_model.coef_)

log_regression_model.fit(X[100:], y[100:])
print(log_regression_model.intercept_, log_regression_model.coef_)

log_regression_model = LogisticRegression(warm_start=False)
log_regression_model.fit(X[100:], y[100:])
print(log_regression_model.intercept_, log_regression_model.coef_)

log_regression_model = LogisticRegression(warm_start=False)
log_regression_model.fit(X, y)
print(log_regression_model.intercept_, log_regression_model.coef_)

gives:

(array([ 0.01846266]), array([[-0.32172516]]))
(array([ 0.17253402]), array([[ 0.33734497]]))
(array([ 0.17253402]), array([[ 0.33734497]]))
(array([ 0.09707612]), array([[ 0.01501025]]))

Based on this other question, warm_start will have some effect if you use another solver (e.g. LogisticRegression(warm_start=True, solver='sag')), but it still won't be the same as re-training on the entire dataset with the new data added. For example, the above four outputs become:

(array([ 0.01915884]), array([[-0.32176053]]))
(array([ 0.17973458]), array([[ 0.33708208]]))
(array([ 0.17968324]), array([[ 0.33707362]]))
(array([ 0.09903978]), array([[ 0.01488605]]))

You can see the middle two lines are different, but not very different. All it does is use the parameters from the last model as a starting point for re-training the new model with the new data. It sounds like what you want to do is save the data, and re-train it with the old data and new data combined every time you add data.

Upvotes: 1

Related Questions