Reputation: 1074
I need to process more than 1,000,000 text records. I am employing CountVectorizer to transform my data. I have the following code.
TEXT = [data[i].values()[3] for i in range(len(data))] #these are the text records
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(TEXT)
X_list = X.toarray().tolist()
As I run this code, it turns out MemoryError
. The text records I have are mostly in short paragraphs (~100 words). Vectorization seems to be very expensive.
UPDATE
I added more constraints to CountVectorizer but still got MemoeryError. The length of feature_names
is 2391.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=0.003,max_df = 3.05, lowercase = True, stop_words = 'english')
X = vectorizer.fit_transform(TEXT)
feature_names = vectorizer.get_feature_names()
X_tolist = X.toarray().tolist()
Traceback (most recent call last):
File "nlp2.py", line 42, in <module>
X_tolist = X.toarray().tolist()
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 940, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/coo.py", line 250, in toarray
B = self._process_toarray_args(order, out)
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/base.py", line 817, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
Why is so and how to get around with it? Thank you!!
Upvotes: 1
Views: 281
Reputation: 2212
Your problem is that X is a sparse matrix with one row for each document representing which words are present in that document. If you have a million documents with a total of 2391 distinct words in all (length of feature_names as provided in your question), the total number of entries in the dense version of x would be about two billion, enough to potentially cause a memory error.
The problem is with this line X_list = X.toarray().tolist()
which converts X to a dense array. You don't have enough memory for that, and there should be a way to do what you are trying to do without it, (as the sparse version of X has all the information that you need.
Upvotes: 2