Diego Pesco Alcalde
Diego Pesco Alcalde

Reputation: 35

Efficiently converting pandas dataframe to scipy sparse matrix

I'm trying to convert a pandas Dataframe to a scipy sparse matrix as a way to efficiently work with many features.

However I didn't find an efficient way to access the values in the dataframe, so I always run out of memory when doing the conversion. I tried the two solutions below and they just don't work. I've researched a lot but didn't find anything better. If anyone has a suggestion I'd be happy to test it.

sparse_array = sparse.csc_matrix(df.values)
sparse_array = sparse.csc_matrix(df.to_numpy())

Upvotes: 1

Views: 306

Answers (1)

CJR
CJR

Reputation: 3985

If your dataframe is very sparse you could convert it column-wise and then stack:

from scipy import sparse

sparse_array = sparse.hstack([sparse.csc_matrix(df[i].values.reshape(-1, 1)) for i in df.columns])

But probably best is to just turn it into a sparse dataframe:

for i in df.columns:
    df[i] = df[i].astype(pd.SparseDtype(df[i].dtype))

sparse_array = sparse.csc_matrix(df.sparse.to_coo())

(Note that there may be an issue if your dtypes are not homogeneous throughout the dataframe).

Upvotes: 1

Related Questions