Reputation: 11173
I'm classifying text data using TfidfVectorizer
from scikit-learn.
I transform my dataset and it turns into a 75k by 172k sparse matrix with 865k stored elements. (I used ngrams ranges of 1,3)
I can fit my data after a long time, but it still indeed fits.
However, when I try to predict the test set I get memory issues. Why is this? I would think that the most memory intensive part would be fitting not predicting?
I've tried doing a few things to circumvent this but have had no luck. First I tried dumping the data locally with joblib.dump
, quitting python and restarting. This unfortunately didn't work.
Then I tried switching over to a HashingVectorizer but ironically, the hashing vectorizer causes memory issues on the same data set. I was under the impression a Hashing Vectorizer would be more memory efficient?
hashing = HashingVectorizer(analyzer='word',ngram_range=(1,3))
tfidf = TfidfVectorizer(analyzer='word',ngram_range=(1,3))
xhash = hashing.fit_transform(x)
xtfidf = tfidf.fit_transform(x)
pac.fit(xhash,y) # causes memory error
pac.fit(xtfidf,y) # works fine
I am using scikit learn 0.15 (bleeding edge) and windows 8.
I have 8 GB RAM and a hard drive with 100 GB free space. I set my virtual RAM to be 50 GB for the purposes of this project. I can set my virtual RAM even higher if needed, but I'm trying to understand the problem before just blunt force try solutions like I have been for the past couple days...I've tried with a few different classifiers: mostly PassiveAggressiveClassifier
, Perception
, MultinomialNB
, and LinearSVC
.
I should also note that at one point I was using a 350k by 472k sparse matrix with 12M stored elements. I was still able to fit the data despite it taking some time. However had memory errors when predicting.
Upvotes: 1
Views: 1534
Reputation: 1733
The scikit-learn library is strongly optimized (and uses NumPy and SciPy). TfidVectorizer
stores sparse matrices (relatively small in size, compared with standard dense matrices).
If you think that it's issue with memory, you can set the max_features
attribute when you create Tfidfvectorizer
. It maybe useful for check your assumptions
(for more detail about Tfidfvectorizer
, see the documentation).
Also, I can reccomend that you reduce training set, and check again; it can also be useful for check your assumptions.
Upvotes: 1