How to identify significant items from a correlation matrix in Python (without inner loop)

Question

I have built a correlation matrix output from a small test set and ended up with the following. True values are those that are greater than a defined value (e.g. results = correlation_matrix > 0.75 )

[[False False False  True]
 [False False  True False]
 [False  True False  True]
 [ True False  True False]]

Note that I also falsified the diagonal (top left to bottom right). I also only need half the matrix because it's a mirror top-left / bottom-right.

Is there a way/function in Numpy (or other) for me to return the row/column of values that are True? When I use this against real data (200k rows), I need to be able to do this quickly without using an inner loop. 200k*200k checks will be very very slow. I imagine there has to be a matrix/numpy/scikit.learn etc function that will provide this but I have not been able to find one.

The expected output from this would be:

[[1, 4], [2, 3], [3, 2], [3, 4], [4, 1], [4, 3]]

Ideally, given that this is a mirror image would be:

[[1, 4], [2, 3], [3, 4]]

Divakar · Accepted Answer

To get the indices with 0-based indexing, one straight-forward way would be to mask out the lower diagonal places with np.triu and then get the indices with np.argwhere -

np.argwhere(np.triu(a))

To mask out diagonal places as well, use np.triu(a,1).

Another way would be to use an explicit mask created with the help of broadcasting -

r = np.arange(a.shape[0])
a[r[:,None] >= r] = 0 # Note that this changes input array
indices = np.argwhere(a)

How to identify significant items from a correlation matrix in Python (without inner loop)

Answers (1)

Related Questions