Reputation: 35
My original problem was to sort a SciPy sparse coo_matrix
named weights
.
Before sorting it, it looked like:
(0, 1) 2.0
(0, 3) 4.0
(1, 0) 5.0
(3, 3) 1.0
(0, 0) 5.0
(2, 4) 1.0
(1, 2) 2.0
(1, 4) 2.0
(0, 2) 3.0
(0, 4) 1.0
(2, 0) 5.0
(4, 3) 3.0
(3, 4) 3.0
(2, 1) 0.0
(3, 0) 5.0
(3, 2) 0.0
(2, 2) 0.0
(4, 4) 0.0
(4, 0) 0.0
(3, 1) 2.0
I want to have the following (end-)result
(0, 0) 5.0
(0, 1) 2.0
(0, 2) 3.0
(0, 3) 4.0
(0, 4) 1.0
(1, 0) 5.0
(1, 2) 2.0
(1, 4) 2.0
(2, 0) 5.0
(2, 1) 0.0
(2, 2) 0.0
(2, 4) 1.0
(3, 0) 5.0
(3, 1) 2.0
(3, 2) 0.0
(3, 3) 1.0
(3, 4) 3.0
(4, 0) 0.0
(4, 3) 3.0
(4, 4) 0.0
What I tried to do so:
weights_tuples = zip(weights.row, weights.col, weights.data)
sorted_weights_tuples = sorted(train_weights_tuples, key=lambda x: (x[0], x[1]))
This did sort it, but didn't result in the right output format:
[(0, 0, 5.0), (0, 1, 2.0), (0, 3, 4.0), (0, 4, 1.0), (1, 1, 4.0), (1, 2, 2.0), (1, 4, 2.0), (2, 0, 5.0), (2, 1, 0.0), (2, 2, 0.0), (2, 3, 0.0), (3, 0, 5.0), (3, 1, 2.0), (3, 2, 0.0), (3, 3, 1.0), (3, 4, 3.0), (4, 0, 0.0), (4, 1, 0.0), (4, 2, 2.0), (4, 3, 3.0)]
My question is, how to convert the gained result into the right format, or whether there is a better way to sort the coo_matrix
to directly get the right output format.
Thanks in advance.
Upvotes: 0
Views: 669
Reputation: 18201
@hpaulj's comment gives the relevant hint, but here's how to make use of this without relying on implementation details of the methods on coo_matrix
, and without the overhead of converting to CSR:
First, use np.lexsort
to get the permutation giving rise to the right order, then create a new coo_matrix
using the initializer which takes as input the sparse representation:
order = np.lexsort((weights.col, weights.row))
sorted_weights = coo_matrix((weights.data[order], (weights.row[order], weights.col[order])),
shape=weights.shape)
If you don't mind making your code a bit more obscure, you might be able to squeeze out a bit of performance by replacing np.lexsort
with
np.argsort(N * rows + cols)
where N
is the number of rows.
Note that, as @hpaulj also says in his comment, coo_matrix.sum_duplicates
does this inplace, and it sets coo_matrix.has_canonical_format
accordingly to indicate that the rows and columns have been sorted. However, in the implementation of sum_duplicates
, weights.col
and weights.row
have been swapped in the call to np.lexsort
, so using that out of the box would net you the opposite of what you want. This also means that the flag has_canonical_format
in fact does not determine a unique format, which is already noted as a bug on GitHub.
Upvotes: 2