Training classifier with large data

I was trying with two class text classification. Usually I created Pickle files of trained model and load those pickle in training phase to eliminate retraining.

When I had 12000 review + more then 50000 tweets for each of the class, the training model size goes to 1.4 GB.

Now storing this large model data into Pickle and loading it is really not feasible and advisable.

Is there any better alternative to this scenario?

Here is sample code, I tried multiple ways of pickleing, here i Have used dill package

    def train(self):
            global pos, neg, totals
            retrain = False

            # Load counts if they already exist.
            if not retrain and os.path.isfile(CDATA_FILE):
                    # pos, neg, totals = cPickle.load(open(CDATA_FILE))
                    pos, neg, totals = dill.load(open(CDATA_FILE, 'r'))
                    return

            for file in os.listdir("./suspected/"):
                    for word in set(self.negate_sequence(open("./unsuspected/" + file).read())):
                            neg[word] += 1
                            pos['not_' + word] += 1
            for file in os.listdir("./suspected/"):
                    for word in set(self.negate_sequence(open("./suspected/" + file).read())):
                            pos[word] += 1
                            neg['not_' + word] += 1

            self.prune_features()

            totals[0] = sum(pos.values())
            totals[1] = sum(neg.values())

            countdata = (pos, neg, totals)
            dill.dump(countdata, open(CDATA_FILE, 'w') )

UPDATE : Reason behind large pickle is, classification data is very large. And I have considered 1-4 gram for feature selection. Classification dataset itself is around 300mb, so considering multigram approach for feature selection creates large training model.

Upvotes: 0

Answers (2)

HaMi

Reputation: 485

The problem and solution is explained here. In short, the problem is due to the fact that when doing featurization, e.g. using CountVectorizer, although you might ask for small number of features e.g. max_features=1000, the transformer still keeps a copy of all possible features for debugging purposes, under the hood. For instance, the CountVectorizer has the following attribute:

stop_words_ : set
Terms that were ignored because they either:
- occurred in too many documents (max_df)
- occurred in too few documents (min_df)
- were cut off by feature selection (max_features).
This is only available if no vocabulary was given.

and this causes the model size to become too large. To solve this issue, you can set stop_words_ to None before pickling your model (taken from the above link's example): (please check the link above for details)

import pickle

model_name = 'clickbait-model-sm.pkl'
cfr_pipeline.named_steps.vectorizer.stop_words_ = None
pickle.dump(cfr_pipeline, open(model_name, 'wb'), protocol=2)

Upvotes: 0

DevShark

Reputation: 9122

Pickle is very heavy as a format. It stores all the details of the objects. It would be much better to store your data in an efficient format like hdf5. If you are not familiar with hdf5, you can look into storing your data in a simple flat text files. You can use csv or json, depending on your data structure. You'll find that either is more efficient than pickle.

You can look at gzip to create and load compressed archives.

Upvotes: 1

Training classifier with large data

Answers (2)

Related Questions