Convert list of tuples to sparse coo_matrix or sort it directly

Question

My original problem was to sort a SciPy sparse coo_matrix named weights.

Before sorting it, it looked like:

  (0, 1)    2.0
  (0, 3)    4.0
  (1, 0)    5.0
  (3, 3)    1.0
  (0, 0)    5.0
  (2, 4)    1.0
  (1, 2)    2.0
  (1, 4)    2.0
  (0, 2)    3.0
  (0, 4)    1.0
  (2, 0)    5.0
  (4, 3)    3.0
  (3, 4)    3.0
  (2, 1)    0.0
  (3, 0)    5.0
  (3, 2)    0.0
  (2, 2)    0.0
  (4, 4)    0.0
  (4, 0)    0.0
  (3, 1)    2.0

I want to have the following (end-)result

  (0, 0)    5.0
  (0, 1)    2.0
  (0, 2)    3.0
  (0, 3)    4.0
  (0, 4)    1.0
  (1, 0)    5.0
  (1, 2)    2.0
  (1, 4)    2.0
  (2, 0)    5.0
  (2, 1)    0.0
  (2, 2)    0.0
  (2, 4)    1.0
  (3, 0)    5.0  
  (3, 1)    2.0
  (3, 2)    0.0
  (3, 3)    1.0 
  (3, 4)    3.0
  (4, 0)    0.0
  (4, 3)    3.0
  (4, 4)    0.0

What I tried to do so:

weights_tuples = zip(weights.row, weights.col, weights.data)
sorted_weights_tuples = sorted(train_weights_tuples, key=lambda x: (x[0], x[1]))

This did sort it, but didn't result in the right output format:

[(0, 0, 5.0), (0, 1, 2.0), (0, 3, 4.0), (0, 4, 1.0), (1, 1, 4.0), (1, 2, 2.0), (1, 4, 2.0), (2, 0, 5.0), (2, 1, 0.0), (2, 2, 0.0), (2, 3, 0.0), (3, 0, 5.0), (3, 1, 2.0), (3, 2, 0.0), (3, 3, 1.0), (3, 4, 3.0), (4, 0, 0.0), (4, 1, 0.0), (4, 2, 2.0), (4, 3, 3.0)]

My question is, how to convert the gained result into the right format, or whether there is a better way to sort the coo_matrix to directly get the right output format.

Thanks in advance.

fuglede · Accepted Answer

@hpaulj's comment gives the relevant hint, but here's how to make use of this without relying on implementation details of the methods on coo_matrix, and without the overhead of converting to CSR:

First, use np.lexsort to get the permutation giving rise to the right order, then create a new coo_matrix using the initializer which takes as input the sparse representation:

order = np.lexsort((weights.col, weights.row))
sorted_weights = coo_matrix((weights.data[order], (weights.row[order], weights.col[order])),
                            shape=weights.shape)

If you don't mind making your code a bit more obscure, you might be able to squeeze out a bit of performance by replacing np.lexsort with

np.argsort(N * rows + cols)

where N is the number of rows.

Note that, as @hpaulj also says in his comment, coo_matrix.sum_duplicates does this inplace, and it sets coo_matrix.has_canonical_format accordingly to indicate that the rows and columns have been sorted. However, in the implementation of sum_duplicates, weights.col and weights.row have been swapped in the call to np.lexsort, so using that out of the box would net you the opposite of what you want. This also means that the flag has_canonical_format in fact does not determine a unique format, which is already noted as a bug on GitHub.

Convert list of tuples to sparse coo_matrix or sort it directly

Answers (1)

Related Questions