Vectorization in sklearn seems to be very memory expensive. Why?

Question

I need to process more than 1,000,000 text records. I am employing CountVectorizer to transform my data. I have the following code.

TEXT = [data[i].values()[3] for i in range(len(data))] #these are the text records

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(TEXT)


X_list = X.toarray().tolist()

As I run this code, it turns out MemoryError. The text records I have are mostly in short paragraphs (~100 words). Vectorization seems to be very expensive.

UPDATE

I added more constraints to CountVectorizer but still got MemoeryError. The length of feature_names is 2391.

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=0.003,max_df = 3.05, lowercase = True, stop_words = 'english')
X = vectorizer.fit_transform(TEXT)
feature_names = vectorizer.get_feature_names()

X_tolist = X.toarray().tolist()

Traceback (most recent call last):
File "nlp2.py", line 42, in 
X_tolist = X.toarray().tolist()
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 940, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/coo.py", line 250, in toarray
B = self._process_toarray_args(order, out)
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/base.py", line 817, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError

Why is so and how to get around with it? Thank you!!

bpachev · Accepted Answer

Your problem is that X is a sparse matrix with one row for each document representing which words are present in that document. If you have a million documents with a total of 2391 distinct words in all (length of feature_names as provided in your question), the total number of entries in the dense version of x would be about two billion, enough to potentially cause a memory error.

The problem is with this line X_list = X.toarray().tolist() which converts X to a dense array. You don't have enough memory for that, and there should be a way to do what you are trying to do without it, (as the sparse version of X has all the information that you need.

Vectorization in sklearn seems to be very memory expensive. Why?

Answers (1)

Related Questions