Getting memory error while transforming spars matrix to array with column names. This array is input to training model

Question

My training data consists of 5 million rows of product description having average length of 10 words. I can use either CountVectorizer or Tf-IDF to transform my input feature. However, post transforming the feature to a sparse matrix, while converting it to an array or dense array, I am constantly getting memory error. The count Vectorizer return ~130k column token. Below are the two methods I am trying to implement. Please note, the system I am working on has 512Gb of Memory. Below is the error:

return np.zeros(self.shape, dtype=self.dtype, order=order) . MemoryError

Method 1

from sklearn.feature_extraction.text import CountVectorizer

vect1 = CountVectorizer(ngram_range= (1,2), min_df = 20) 

#vect1.fit(train_data['description_cleaned'])

train_dtm1 = vect1.fit_transform(train_data)

dtm_data = pd.DataFrame(train_dtm1.toarray(), columns=vect1.get_feature_names())

Method 2

tfidf  = TfidfVectorizer(stop_words='english',  ngram_range=(1, 2), max_df=0.5, min_df=20, use_idf=True)

corpus = tfidf.fit_transform(train_data)

dtm_data = pd.DataFrame(corpus_split.todense(), columns=tfidf.get_feature_names())

dtm_data goes into test-train split, which further goes into Keras ANN. How to resolve this memory issue?

Getting memory error while transforming spars matrix to array with column names. This array is input to training model

Answers (1)

Related Questions