turnip
turnip

Reputation: 2346

Can I train my classifier multiple times?

I am building a basic NLP program using nltk and sklearn. I have a large dataset in a database and I am wondering what the best way to train the classifier is.

Is it advisable to download the training data in chunks and pass each chunk to the classifier? Is that even possible, or would I be overwriting what was learned from the previous chunk?

from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB

while True:
    training_set, proceed = download_chunk()  # pseudo
    trained = SklearnClassifier(MultinomialNB()).train(training_set)
    if not proceed:
        break

How is this normally done? I want to avoid keeping the database connection open for too long.

Upvotes: 2

Views: 1761

Answers (1)

gaw89
gaw89

Reputation: 1068

The way you're doing it right now will actually just overwrite the classifier for each chunk in your training data as you're creating a new SklearnClassifier object each time. What you need to do is instantiate the SklearnClassifier prior to getting into the training loop. However, looking at the code here, it appears that the NLTK SklearnClassifier uses the fit method of the underlying Sklearn model. This means that you can't actually update a model once it is trained. What you need to do is instantiate the Sklearn model directly and use the partial_fit method. Something like this should work:

from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB() # must instantiate classifier outside of the loop or it will just get overwritten

while True:
    training_set, proceed = download_chunk()  # pseudo
    clf.partial_fit(training_set)
    if not proceed:
        break

At the end, you'll have a MultinomialNB() classifier that has been trained on each chunk of your data.

Typically, if the whole dataset will fit in memory, it is somewhat more performant to just download the whole thing and call fit once (in which case you could actually use the nltk SklearnClassifier). See the notes about the partial_fit method here. However, if you are unable to fit the entire set in memory, it is certainly common practice to train on chunks of the data. You can do this by making several calls to the database or by extracting all of the information from the database, placing it in a CSV on your hard drive, and reading chunks of it from there.

Note

If you're using a shared database with other users, the DBAs may prefer you to extract all of it at once as once as this would (probably) take up fewer DB resources than making several separate, smaller calls to the database would.

Upvotes: 7

Related Questions