Reputation: 35
I'm trying to convert a pandas Dataframe to a scipy sparse matrix as a way to efficiently work with many features.
However I didn't find an efficient way to access the values in the dataframe, so I always run out of memory when doing the conversion. I tried the two solutions below and they just don't work. I've researched a lot but didn't find anything better. If anyone has a suggestion I'd be happy to test it.
sparse_array = sparse.csc_matrix(df.values)
sparse_array = sparse.csc_matrix(df.to_numpy())
Upvotes: 1
Views: 306
Reputation: 3985
If your dataframe is very sparse you could convert it column-wise and then stack:
from scipy import sparse
sparse_array = sparse.hstack([sparse.csc_matrix(df[i].values.reshape(-1, 1)) for i in df.columns])
But probably best is to just turn it into a sparse dataframe:
for i in df.columns:
df[i] = df[i].astype(pd.SparseDtype(df[i].dtype))
sparse_array = sparse.csc_matrix(df.sparse.to_coo())
(Note that there may be an issue if your dtypes are not homogeneous throughout the dataframe).
Upvotes: 1