Convert pandas single column to Scipy Sparse Matrix

Question

I have a pandas data frame like this:

     a                           other-columns
   0.3 0.2 0.0 0.0 0.0...        ....

I want to convert column a into SciPy sparse CSR matrix. a is a probability distribution. I would like to convert without expanding a into multiple columns.

This is naive solution with expanding a into multiple columns:

  df = df.join(df['a'].str.split(expand = True).add_prefix('a')).drop(['a'], axis = 1)
  df_matrix = scipy.sparse.csr_matrix(df.values)

But, I don't want to expand into multiple columns, as it shoots up the memory. Is it possible to do this by keeping a in 1 column only?

EDIT (Minimum Reproducible Example):

 import pandas as pd
 from scipy.sparse import csr_matrix
 d = {'a': ['0.05 0.0', '0.2 0.0']}
 df = pd.DataFrame(data=d)
 df = df.join(df['a'].str.split(expand = True).add_prefix('a')).drop(['a'], axis = 1)
 df = df.astype(float)
 df_matrix = scipy.sparse.csr_matrix(df.values)
 df_matrix

Output:

 <2x2 sparse matrix of type ''
with 2 stored elements in Compressed Sparse Row format>

I want to achieve above, but, without splitting into multiple columns. Also, in my real file, I have 36 length string (separated by space) columns and millions of rows. It is sure that all rows will contain 36 spaces.

CJR · Accepted Answer

Also, in my real file, I have 36 length string (separated by space) columns and millions of rows. It is sure that all rows will contain 36 spaces.

Convert large csv to sparse matrix for use in sklearn

I can not overstate how much you should not do the thing that follows this sentence.

import pandas as pd
import numpy as np
from scipy import sparse

df = pd.DataFrame({'a': ['0.05 0.0', '0.2 0.0'] * 100000})
chunksize = 10000

sparse_coo = []
for i in range(int(np.ceil(df.shape[0]/chunksize))):
    chunk = df.iloc[i * chunksize:min(i * chunksize +chunksize, df.shape[0]), :]
    sparse_coo.append(sparse.coo_matrix(chunk['a'].apply(lambda x: [float(y) for y in x.split()]).tolist()))

sparse_coo = sparse.vstack(sparse_coo)

Convert pandas single column to Scipy Sparse Matrix

Answers (2)

Related Questions