Reputation: 55
I'm using scikit-learn where I've saved a logistic regression model with unigrams as features from training set 1. Is it possible to load this model and then expand it with new data instances from a second training set (training set 2)? If yes, how can this be done? The reason for doing this is because I'm using two different approaches for each of the training sets (the first approach involves feature corruption/regularization, and the second approach involves self-training).
I've added some simple example code for clarity:
from sklearn.linear_model import LogisticRegression as log
from sklearn.feature_extraction.text import CountVectorizer as cv
import pickle
trainText1 # Training set 1 text instances
trainLabel1 # Training set 1 labels
trainText2 # Training set 2 text instances
trainLabel2 # Training set 2 labels
clf = log()
# Count vectorizer used by the logistic regression classifier
vec = cv()
# Fit count vectorizer with training text data from training set 1
vec.fit(trainText1)
# Transforms text into vectors for training set1
train1Text1 = vec.transform(trainText1)
# Fitting training set1 to the linear logistic regression classifier
clf.fit(trainText1,trainLabel1)
# Saving logistic regression model from training set 1
modelFileSave = open('modelFromTrainingSet1', 'wb')
pickle.dump(clf, modelFileSave)
modelFileSave.close()
# Loading logistic regression model from training set 1
modelFileLoad = open('modelFromTrainingSet1', 'rb')
clf = pickle.load(modelFileLoad)
# I'm unsure how to continue from here....
Upvotes: 2
Views: 2750
Reputation: 40169
LogisticRegression
uses internally the liblinear solver that does not support incremental fitting. Instead you could use SGDClassifier(loss='log')
that as a partial_fit
method that could be used for this although in practice. The other hyperparameters are different. Be careful to grid search their optimal value carefully. Read the SGDClassifier
documentation for the meaning of those hyperparameters.
CountVectorizer
does not support incremental fitting. You would have to reuse the vectorizer fitted on train set #1 to transform #2. That means that any token from set #2 not already seen in #1 will be completely ignored though. This might not be what you expect.
To mitigate this you can use HashingVectorizer
that is stateless at the cost of not knowing what the features mean. Read the documentation for more details.
Upvotes: 4