infiniteloop
infiniteloop

Reputation: 2212

Convert a sparse matrix to dataframe

I have a sparse matrix that stores computed similarities between a set of documents. The matrix is an ndarray.

         0         1         2         3         4          
0        1.000000  0.000000  0.000000  0.000000  0.000000  
1        0.000000  1.000000  0.067279  0.000000  0.000000  
2        0.000000  0.067279  1.000000  0.025758  0.012039  
3        0.000000  0.000000  0.025758  1.000000  0.000000  
4        0.000000  0.000000  0.012039  0.000000  1.000000  

I would like to transform this data into a 3-dimensional dataframe as follows.

docA docB similarity
1    2    0.067279
2    3    0.025758
2    4    0.012039

This final result does not contain matrix diagonals or zero values. It also lists each document pair only once (i.e. in one row only). Is there is a built-in / efficient method to achieve this end result? Any pointers would be much appreciated.

Thanks!

Upvotes: 1

Views: 505

Answers (1)

Mateen Ulhaq
Mateen Ulhaq

Reputation: 27271

Convert the dataframe to an array:

x = df.to_numpy()

Get a list of non-diagonal non-zero entries from the sparse symmetric distance matrix:

i, j = np.triu_indices_from(x, k=1)
v = x[i, j]
ijv = np.concatenate((i, j, v)).reshape(3, -1).T
ijv = ijv[v != 0.0]

Convert it back to a dataframe:

df_ijv = pd.DataFrame(ijv)

I'm not sure if this is any faster or anything but an alternative way to do the middle step is to convert the numpy array to an ijv or "triplet" sparse matrix:

from scipy import sparse
coo = sparse.coo_matrix(x)
ijv = np.concatenate((coo.row, coo.col, coo.data)).reshape(3, -1).T

Now given a symmetric distance matrix, all you need to do is to keep the non-zero elements on the upper right triangle. You could loop through these. Or you could pre-mask the array with np.triu_indices_from(x, k=1), but that kind of defeats the whole purpose of this supposedly faster method... hmmm.

Upvotes: 2

Related Questions