How to load previously saved model and expand the model with new training data using scikit-learn

Question

I'm using scikit-learn where I've saved a logistic regression model with unigrams as features from training set 1. Is it possible to load this model and then expand it with new data instances from a second training set (training set 2)? If yes, how can this be done? The reason for doing this is because I'm using two different approaches for each of the training sets (the first approach involves feature corruption/regularization, and the second approach involves self-training).

I've added some simple example code for clarity:

from sklearn.linear_model import LogisticRegression as log
from sklearn.feature_extraction.text import CountVectorizer as cv
import pickle

trainText1 # Training set 1 text instances    
trainLabel1 # Training set 1 labels 
trainText2 # Training set 2 text instances    
trainLabel2 # Training set 2 labels 

clf = log()
# Count vectorizer used by the logistic regression classifier 
vec = cv() 

# Fit count vectorizer with training text data from training set 1
vec.fit(trainText1) 

# Transforms text into vectors for training set1
train1Text1 = vec.transform(trainText1) 

# Fitting training set1 to the linear logistic regression classifier 
clf.fit(trainText1,trainLabel1)

# Saving logistic regression model from training set 1
modelFileSave = open('modelFromTrainingSet1', 'wb')
pickle.dump(clf, modelFileSave)
modelFileSave.close()  

# Loading logistic regression model from training set 1    
modelFileLoad = open('modelFromTrainingSet1', 'rb')
clf = pickle.load(modelFileLoad)

# I'm unsure how to continue from here....

ogrisel · Accepted Answer

LogisticRegression uses internally the liblinear solver that does not support incremental fitting. Instead you could use SGDClassifier(loss='log') that as a partial_fit method that could be used for this although in practice. The other hyperparameters are different. Be careful to grid search their optimal value carefully. Read the SGDClassifier documentation for the meaning of those hyperparameters.

CountVectorizer does not support incremental fitting. You would have to reuse the vectorizer fitted on train set #1 to transform #2. That means that any token from set #2 not already seen in #1 will be completely ignored though. This might not be what you expect.

To mitigate this you can use HashingVectorizer that is stateless at the cost of not knowing what the features mean. Read the documentation for more details.

How to load previously saved model and expand the model with new training data using scikit-learn

Answers (1)

Related Questions