acdr
acdr

Reputation: 4706

Efficiently find indices of all values in an array

I have a very large array, consisting of integers between 0 and N, where each value occurs at least once.

I'd like to know, for each value k, all the indices in my array where the array's value equals k.

For example:

arr = np.array([0,1,2,3,2,1,0])
desired_output = {
    0: np.array([0,6]),
    1: np.array([1,5]),
    2: np.array([2,4]),
    3: np.array([3]),
    }

Right now I am accomplishing this with a loop over range(N+1), and calling np.where N times.

indices = {}
for value in range(max(arr)+1):
    indices[value] = np.where(arr == value)[0]

This loop is by far the slowest part of my code. (Both the arr==value evaluation and the np.where call take up significant chunks of time.) Is there a more efficient way to do this?

I also tried playing around with np.unique(arr, return_index=True) but that only tells me the very first index, rather than all of them.

Upvotes: 4

Views: 195

Answers (4)

Kasravnd
Kasravnd

Reputation: 107287

A pythonic way is using collections.defaultdict():

>>> from collections import defaultdict
>>> 
>>> d = defaultdict(list)
>>> 
>>> for i, j in enumerate(arr):
...     d[j].append(i)
... 
>>> d
defaultdict(<type 'list'>, {0: [0, 6], 1: [1, 5], 2: [2, 4], 3: [3]})

And here is a Numpythonic way using a dictionary comprehension and numpy.where():

>>> {i: np.where(arr == i)[0] for i in np.unique(arr)}
{0: array([0, 6]), 1: array([1, 5]), 2: array([2, 4]), 3: array([3])}

And here is a pure Numpythonic approach if you don't want to involve the dictionary:

>>> uniq = np.unique(arr)
>>> args, indices = np.where((np.tile(arr, len(uniq)).reshape(len(uniq), len(arr)) == np.vstack(uniq)))
>>> np.split(indices, np.where(np.diff(args))[0] + 1)
[array([0, 6]), array([1, 5]), array([2, 4]), array([3])]

Upvotes: 3

Eelco Hoogendoorn
Eelco Hoogendoorn

Reputation: 10759

Fully vectorized solution using the numpy_indexed package:

import numpy_indexed as npi
k, idx = npi.groupy_by(arr, np.arange(len(arr)))

On a higher level; why do you need these indices? Subsequent grouped-operations can usually be computed much more efficiently using the group_by functionality [eg, npi.group_by(arr).mean(someotherarray)], without explicitly computing the indices of the keys.

Upvotes: 0

Divakar
Divakar

Reputation: 221524

Approach #1

Here's a vectorized approach to get those indices as list of arrays -

sidx = arr.argsort()
unq, cut_idx = np.unique(arr[sidx],return_index=True)
indices = np.split(sidx,cut_idx)[1:]

If you want the final dictionary that corresponds each unique element to their indices, finally we can use a loop-comprehension -

dict_out = {unq[i]:iterID for i,iterID in enumerate(indices)}

Approach #2

If you are just interested in the list of arrays, here's an alternative meant for performance -

sidx = arr.argsort()
indices = np.split(sidx,np.flatnonzero(np.diff(arr[sidx])>0)+1)

Upvotes: 7

Daniel Roseman
Daniel Roseman

Reputation: 599490

I don't know numpy but you could definitely do this in one iteration, with a defaultdict:

indices = defaultdict(list)
for i, val in enumerate(arr):
    indices[val].append(i)

Upvotes: 1

Related Questions