learner
learner

Reputation: 877

Convert pandas single column to Scipy Sparse Matrix

I have a pandas data frame like this:

     a                           other-columns
   0.3 0.2 0.0 0.0 0.0...        ....

I want to convert column a into SciPy sparse CSR matrix. a is a probability distribution. I would like to convert without expanding a into multiple columns.

This is naive solution with expanding a into multiple columns:

  df = df.join(df['a'].str.split(expand = True).add_prefix('a')).drop(['a'], axis = 1)
  df_matrix = scipy.sparse.csr_matrix(df.values)

But, I don't want to expand into multiple columns, as it shoots up the memory. Is it possible to do this by keeping a in 1 column only?

EDIT (Minimum Reproducible Example):

 import pandas as pd
 from scipy.sparse import csr_matrix
 d = {'a': ['0.05 0.0', '0.2 0.0']}
 df = pd.DataFrame(data=d)
 df = df.join(df['a'].str.split(expand = True).add_prefix('a')).drop(['a'], axis = 1)
 df = df.astype(float)
 df_matrix = scipy.sparse.csr_matrix(df.values)
 df_matrix

Output:

 <2x2 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Row format>

I want to achieve above, but, without splitting into multiple columns. Also, in my real file, I have 36 length string (separated by space) columns and millions of rows. It is sure that all rows will contain 36 spaces.

Upvotes: 0

Views: 927

Answers (2)

CJR
CJR

Reputation: 3985

Also, in my real file, I have 36 length string (separated by space) columns and millions of rows. It is sure that all rows will contain 36 spaces.

Convert large csv to sparse matrix for use in sklearn

I can not overstate how much you should not do the thing that follows this sentence.

import pandas as pd
import numpy as np
from scipy import sparse

df = pd.DataFrame({'a': ['0.05 0.0', '0.2 0.0'] * 100000})
chunksize = 10000

sparse_coo = []
for i in range(int(np.ceil(df.shape[0]/chunksize))):
    chunk = df.iloc[i * chunksize:min(i * chunksize +chunksize, df.shape[0]), :]
    sparse_coo.append(sparse.coo_matrix(chunk['a'].apply(lambda x: [float(y) for y in x.split()]).tolist()))

sparse_coo = sparse.vstack(sparse_coo)

Upvotes: 1

hpaulj
hpaulj

Reputation: 231738

You could get the dense array from the column without the expand:

In [179]: df = pd.DataFrame(data=d)                                                                  

e.g.

In [180]: np.array(df['a'].str.split().tolist(),float)                                               
Out[180]: 
array([[0.05, 0.  ],
       [0.2 , 0.  ]])

But I doubt if that saves much in memory (though I only have a crude understanding of DataFrame memory use.

You could convert each string to a sparse matrix:

In [190]: def foo(astr): 
     ...:     alist = astr.split() 
     ...:     arr = np.array(alist, float) 
     ...:     return sparse.coo_matrix(arr) 
                                                                                               
In [191]: alist = [foo(row) for row in df['a']]                                                      
In [192]: alist                                                                                      
Out[192]: 
[<1x2 sparse matrix of type '<class 'numpy.float64'>'
    with 1 stored elements in COOrdinate format>,
 <1x2 sparse matrix of type '<class 'numpy.float64'>'
    with 1 stored elements in COOrdinate format>]
In [193]: sparse.vstack(alist)                                                                       
Out[193]: 
<2x2 sparse matrix of type '<class 'numpy.float64'>'
    with 2 stored elements in COOrdinate format>

I tried to make the coo directly from the alist, but that didn't trim out the zeros. There's just as much conversion, but if sufficiently sparse (5% or less) it could save quite a bit on memory (if not time).

sparse.vstack combines the data,rows,cols values from the component matrices to define a new coo matrix. It's most straight forward way of combining sparse matrices, if not the fastest.

Looks like I could use apply as well

In [205]: df['a'].apply(foo)                                                                         
Out[205]: 
0      (0, 0)\t0.05
1       (0, 0)\t0.2
Name: a, dtype: object
In [206]: df['a'].apply(foo).values                                                                  
Out[206]: 
array([<1x2 sparse matrix of type '<class 'numpy.float64'>'
    with 1 stored elements in COOrdinate format>,
       <1x2 sparse matrix of type '<class 'numpy.float64'>'
    with 1 stored elements in COOrdinate format>], dtype=object)
In [207]: sparse.vstack(df['a'].apply(foo))                                                          
Out[207]: 
<2x2 sparse matrix of type '<class 'numpy.float64'>'
    with 2 stored elements in COOrdinate format>

Upvotes: 1

Related Questions