Memory Error Using scikit-learn on Predicting but Not Fitting

Question

I'm classifying text data using TfidfVectorizer from scikit-learn.

I transform my dataset and it turns into a 75k by 172k sparse matrix with 865k stored elements. (I used ngrams ranges of 1,3)

I can fit my data after a long time, but it still indeed fits.

However, when I try to predict the test set I get memory issues. Why is this? I would think that the most memory intensive part would be fitting not predicting?

I've tried doing a few things to circumvent this but have had no luck. First I tried dumping the data locally with joblib.dump, quitting python and restarting. This unfortunately didn't work.

Then I tried switching over to a HashingVectorizer but ironically, the hashing vectorizer causes memory issues on the same data set. I was under the impression a Hashing Vectorizer would be more memory efficient?

hashing = HashingVectorizer(analyzer='word',ngram_range=(1,3))
tfidf = TfidfVectorizer(analyzer='word',ngram_range=(1,3))

xhash = hashing.fit_transform(x)
xtfidf = tfidf.fit_transform(x)

pac.fit(xhash,y) # causes memory error
pac.fit(xtfidf,y) # works fine

I am using scikit learn 0.15 (bleeding edge) and windows 8.

I have 8 GB RAM and a hard drive with 100 GB free space. I set my virtual RAM to be 50 GB for the purposes of this project. I can set my virtual RAM even higher if needed, but I'm trying to understand the problem before just blunt force try solutions like I have been for the past couple days...I've tried with a few different classifiers: mostly PassiveAggressiveClassifier, Perception, MultinomialNB, and LinearSVC.

I should also note that at one point I was using a 350k by 472k sparse matrix with 12M stored elements. I was still able to fit the data despite it taking some time. However had memory errors when predicting.

Memory Error Using scikit-learn on Predicting but Not Fitting

Answers (1)

Related Questions