shantanu sinha
shantanu sinha

Reputation: 11

Getting memory error while transforming spars matrix to array with column names. This array is input to training model

My training data consists of 5 million rows of product description having average length of 10 words. I can use either CountVectorizer or Tf-IDF to transform my input feature. However, post transforming the feature to a sparse matrix, while converting it to an array or dense array, I am constantly getting memory error. The count Vectorizer return ~130k column token. Below are the two methods I am trying to implement. Please note, the system I am working on has 512Gb of Memory. Below is the error:

return np.zeros(self.shape, dtype=self.dtype, order=order) . MemoryError

Method 1

from sklearn.feature_extraction.text import CountVectorizer

vect1 = CountVectorizer(ngram_range= (1,2), min_df = 20) 

#vect1.fit(train_data['description_cleaned'])

train_dtm1 = vect1.fit_transform(train_data)

dtm_data = pd.DataFrame(train_dtm1.toarray(), columns=vect1.get_feature_names())

Method 2

tfidf  = TfidfVectorizer(stop_words='english',  ngram_range=(1, 2), max_df=0.5, min_df=20, use_idf=True)

corpus = tfidf.fit_transform(train_data)

dtm_data = pd.DataFrame(corpus_split.todense(), columns=tfidf.get_feature_names())

dtm_data goes into test-train split, which further goes into Keras ANN. How to resolve this memory issue?

Upvotes: 1

Views: 230

Answers (1)

KetZoomer
KetZoomer

Reputation: 2915

Out of memory error happens when python is using more memory than available. Along with your system memory, look at your graphics card memory if you are using tensorflow-gpu. You might want to take a look at google colab, which runs the python program in the cloud.

Upvotes: 0

Related Questions