Reputation: 11
My training data consists of 5 million rows of product description having average length of 10 words. I can use either CountVectorizer or Tf-IDF to transform my input feature. However, post transforming the feature to a sparse matrix, while converting it to an array or dense array, I am constantly getting memory error. The count Vectorizer return ~130k column token. Below are the two methods I am trying to implement. Please note, the system I am working on has 512Gb of Memory. Below is the error:
return np.zeros(self.shape, dtype=self.dtype, order=order) . MemoryError
Method 1
from sklearn.feature_extraction.text import CountVectorizer
vect1 = CountVectorizer(ngram_range= (1,2), min_df = 20)
#vect1.fit(train_data['description_cleaned'])
train_dtm1 = vect1.fit_transform(train_data)
dtm_data = pd.DataFrame(train_dtm1.toarray(), columns=vect1.get_feature_names())
Method 2
tfidf = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_df=0.5, min_df=20, use_idf=True)
corpus = tfidf.fit_transform(train_data)
dtm_data = pd.DataFrame(corpus_split.todense(), columns=tfidf.get_feature_names())
dtm_data goes into test-train split, which further goes into Keras ANN. How to resolve this memory issue?
Upvotes: 1
Views: 230
Reputation: 2915
Out of memory error happens when python is using more memory than available. Along with your system memory, look at your graphics card memory if you are using tensorflow-gpu. You might want to take a look at google colab, which runs the python program in the cloud.
Upvotes: 0