GreenGodot
GreenGodot

Reputation: 6773

Out-Of-Core learning for Sklearn pipelines

I'm a newbie doing some work in Sklearn using SGDClassifier, to classify one-sentence texts using labels. (Think Ham/Spam emails for example) Here is my pipeline:

clf = SGDClassifier(fit_intercept=True, loss='modified_huber', alpha=.0001, shuffle=True,
                      n_iter=15, n_jobs=-1, penalty='elasticnet')
vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range=(3, 5), lowercase=True, stop_words=stopset,
                           use_idf=True, norm='l2')
pipeline = Pipeline([
    ('mapper', vectorizer),
    ('clf', clf),
])

I am familiar with the usage of partial_fit to avoid having to load an entire training dataset into memory (Out-of-core learning) but my question is if it is possible for the classifier to call partial_fit after an initial training set was loaded into memory.

In my use case, imagine that each text my algorithm has to classify after training has 'relative' texts linked to it that have extremely similar features and with the only difference being misspellings in the text. I would like these 'relative' texts to be automatically added to the classifier's knowledge under the same label as the original email so common misspellings that evade the algorithm will be correctly labeled as well.

In essence, I want an update-able classifier, what would the best way be to do this in python?

Upvotes: 3

Views: 980

Answers (1)

ponadto
ponadto

Reputation: 722

The way I understand your question, you have a classifier that was already pre-trained on some initial set, and you would like to make predictions on new observations, and then add those observations (once we know what were the actual labels of those observations) to further train your model.

I actually thought it can be readily done just by calling partial_fit with those new observations, no strings attached (so to speak). This is a very good example and it seems adaptable to your purposes.

Upvotes: 1

Related Questions