Out-Of-Core learning for Sklearn pipelines

Question

I'm a newbie doing some work in Sklearn using SGDClassifier, to classify one-sentence texts using labels. (Think Ham/Spam emails for example) Here is my pipeline:

clf = SGDClassifier(fit_intercept=True, loss='modified_huber', alpha=.0001, shuffle=True,
                      n_iter=15, n_jobs=-1, penalty='elasticnet')
vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range=(3, 5), lowercase=True, stop_words=stopset,
                           use_idf=True, norm='l2')
pipeline = Pipeline([
    ('mapper', vectorizer),
    ('clf', clf),
])

I am familiar with the usage of partial_fit to avoid having to load an entire training dataset into memory (Out-of-core learning) but my question is if it is possible for the classifier to call partial_fit after an initial training set was loaded into memory.

In my use case, imagine that each text my algorithm has to classify after training has 'relative' texts linked to it that have extremely similar features and with the only difference being misspellings in the text. I would like these 'relative' texts to be automatically added to the classifier's knowledge under the same label as the original email so common misspellings that evade the algorithm will be correctly labeled as well.

In essence, I want an update-able classifier, what would the best way be to do this in python?

Out-Of-Core learning for Sklearn pipelines

Answers (1)

Related Questions