Bharat Sharma
Bharat Sharma

Reputation: 1219

memory issues for sparse one hot encoded features

I want to create sparse matrix for one hot encoded features from data frame df. But I am getting memory issue for code given below. Shape of sparse_onehot is (450138, 1508)

sp_features = ['id', 'video_id', 'genre']
sparse_onehot = pd.get_dummies(df[sp_features], columns = sp_features)
import scipy
X = scipy.sparse.csr_matrix(sparse_onehot.values)

I get memory error as shown below.

MemoryError: Unable to allocate 647. MiB for an array with shape (1508, 450138) and data type uint8

I have tried scipy.sparse.lil_matrix and get same error as above.

Is there any efficient way of handling this? Thanks in advance

Upvotes: 2

Views: 152

Answers (1)

Ami Tavory
Ami Tavory

Reputation: 76316

Try setting to True the sparse parameter:

sparsebool, default False Whether the dummy-encoded columns should be backed by a SparseArray (True) or a regular NumPy array (False).

sparse_onehot = pd.get_dummies(df[sp_features], columns = sp_features, sparse = True)

This will use a much more memory efficient (but somewhat slower) representation than the default one.

Upvotes: 1

Related Questions