Reputation: 2115
I have a numpy array like this: [1 2 2 0 0 1 3 5]
Is it possible to get the index of the elements as a 2d array? For instance the answer for the above input would be [[3 4], [0 5], [1 2], [6], [], [7]]
Currently I have to loop the different values and call numpy.where(input == i)
for each value, which has terrible performance with a big enough input.
Upvotes: 11
Views: 3757
Reputation: 480
So given a list of elements, you want to make (element, index) pairs. In linear time, this could be done as:
hashtable = dict()
for idx, val in enumerate(mylist):
if val not in hashtable.keys():
hashtable[val] = list()
hashtable[val].append(idx)
newlist = sorted(hashtable.values())
This should take O(n) time. I can not think of a faster solution as of now, but will update here if I do.
Upvotes: 0
Reputation: 136
This gives you exactly what you want and would take about 2.5 seconds for 10,000,000 on my machine:
import numpy as np
import timeit
# x = np.array("1 2 2 0 0 1 3 5".split(),int)
x = np.random.randint(0, 100, 100000)
def create_index_list(x):
d = {}
max_value = -1
for i,v in enumerate(x):
if v > max_value:
max_value = v
try:
d[v].append(i)
except:
d[v] = [i]
result_list = []
for i in range(max_value+1):
if i in d:
result_list.append(d[i])
else:
result_list.append([])
return result_list
# print(create_index_list(x))
print(timeit.timeit(stmt='create_index_list(x)', number=1, globals=globals()))
Upvotes: 1
Reputation: 53109
Here is an O(max(x)+len(x)) approach using scipy.sparse
:
import numpy as np
from scipy import sparse
x = np.array("1 2 2 0 0 1 3 5".split(),int)
x
# array([1, 2, 2, 0, 0, 1, 3, 5])
M,N = x.max()+1,x.size
sparse.csc_matrix((x,x,np.arange(N+1)),(M,N)).tolil().rows.tolist()
# [[3, 4], [0, 5], [1, 2], [6], [], [7]]
This works by creating a sparse matrix with entries at positions (x[0],0), (x[1],1), ... Using the CSC
(compressed sparse column) format this is rather simple. The matrix is then converted to LIL
(linked list) format. This format stores the column indices for each row as a list in its rows
attribute, so all we need to do is take that and convert it to list.
Note that for small arrays argsort
based solutions are probably faster but at some not insanely large size this will cross over.
EDIT:
argsort
-based numpy
-only solution:
np.split(x.argsort(kind="stable"),np.bincount(x)[:-1].cumsum())
# [array([3, 4]), array([0, 5]), array([1, 2]), array([6]), array([], dtype=int64), array([7])]
If the order of indices within groups doesn't matter you can also try argpartition
(it happens to make no difference in this small example but this is not guaranteed in general):
bb = np.bincount(x)[:-1].cumsum()
np.split(x.argpartition(bb),bb)
# [array([3, 4]), array([0, 5]), array([1, 2]), array([6]), array([], dtype=int64), array([7])]
EDIT:
@Divakar recommends against the use of np.split
. Instead, a loop is probably faster:
A = x.argsort(kind="stable")
B = np.bincount(x+1).cumsum()
[A[B[i-1]:B[i]] for i in range(1,len(B))]
Or you could use the brand new (Python3.8+) walrus operator:
A = x.argsort(kind="stable")
B = np.bincount(x)
L = 0
[A[L:(L:=L+b)] for b in B.tolist()]
EDIT(EDITED):
(Not pure numpy): As an alternative to numba (see @senderle's post) we can also use pythran.
Compile with pythran -O3 <filename.py>
import numpy as np
#pythran export sort_to_bins(int[:],int)
def sort_to_bins(idx, mx):
if mx==-1:
mx = idx.max() + 1
cnts = np.zeros(mx + 2, int)
for i in range(idx.size):
cnts[idx[i] + 2] += 1
for i in range(3, cnts.size):
cnts[i] += cnts[i-1]
res = np.empty_like(idx)
for i in range(idx.size):
res[cnts[idx[i]+1]] = i
cnts[idx[i]+1] += 1
return [res[cnts[i]:cnts[i+1]] for i in range(mx)]
Here numba
wins by a whisker performance-wise:
repeat(lambda:enum_bins_numba_buffer(x),number=10)
# [0.6235917090671137, 0.6071486569708213, 0.6096088469494134]
repeat(lambda:sort_to_bins(x,-1),number=10)
# [0.6235359431011602, 0.6264424560358748, 0.6217901279451326]
Older stuff:
import numpy as np
#pythran export bincollect(int[:])
def bincollect(a):
o = [[] for _ in range(a.max()+1)]
for i,j in enumerate(a):
o[j].append(i)
return o
Timings vs. numba (old)
timeit(lambda:bincollect(x),number=10)
# 3.5732191529823467
timeit(lambda:enumerate_bins(x),number=10)
# 6.7462647299980745
Upvotes: 11
Reputation: 151177
Although the request is for a numpy
solution, I decided to see whether there is an interesting numba
-based solution. And indeed there is! Here's an approach that represents the partitioned list as a ragged array stored in a single preallocated buffer. This takes some inspiration from the argsort
approach proposed by Paul Panzer. (For an older version that didn't do as well, but was simpler, see below.)
@numba.jit(numba.void(numba.int64[:],
numba.int64[:],
numba.int64[:]),
nopython=True)
def enum_bins_numba_buffer_inner(ints, bins, starts):
for x in range(len(ints)):
i = ints[x]
bins[starts[i]] = x
starts[i] += 1
@numba.jit(nopython=False) # Not 100% sure this does anything...
def enum_bins_numba_buffer(ints):
ends = np.bincount(ints).cumsum()
starts = np.empty(ends.shape, dtype=np.int64)
starts[1:] = ends[:-1]
starts[0] = 0
bins = np.empty(ints.shape, dtype=np.int64)
enum_bins_numba_buffer_inner(ints, bins, starts)
starts[1:] = ends[:-1]
starts[0] = 0
return [bins[s:e] for s, e in zip(starts, ends)]
This processes a ten-million item list in 75ms, which is nearly a 50x speedup from a list-based version written in pure Python.
For a slower but somewhat more readable version, here's what I had before, based on recently added experimental support for dynamically sized "typed lists," which allow us to fill up each bin in an out-of-order way much more quickly.
This wrestles with numba
's type inference engine a bit, and I'm sure there's a better way to handle that part. This also turns out to be almost 10x slower than the above.
@numba.jit(nopython=True)
def enum_bins_numba(ints):
bins = numba.typed.List()
for i in range(ints.max() + 1):
inner = numba.typed.List()
inner.append(0) # An awkward way of forcing type inference.
inner.pop()
bins.append(inner)
for x, i in enumerate(ints):
bins[i].append(x)
return bins
I tested these against the following:
def enum_bins_dict(ints):
enum_bins = defaultdict(list)
for k, v in enumerate(ints):
enum_bins[v].append(k)
return enum_bins
def enum_bins_list(ints):
enum_bins = [[] for i in range(ints.max() + 1)]
for x, i in enumerate(ints):
enum_bins[i].append(x)
return enum_bins
def enum_bins_sparse(ints):
M, N = ints.max() + 1, ints.size
return sparse.csc_matrix((ints, ints, np.arange(N + 1)),
(M, N)).tolil().rows.tolist()
I also tested them against a precompiled cython version similar to enum_bins_numba_buffer
(described in detail below).
On a list of ten million random ints (ints = np.random.randint(0, 100, 10000000)
) I get the following results:
enum_bins_dict(ints)
3.71 s ± 80.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
enum_bins_list(ints)
3.28 s ± 52.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
enum_bins_sparse(ints)
1.02 s ± 34.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
enum_bins_numba(ints)
693 ms ± 5.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
enum_bins_cython(ints)
82.3 ms ± 1.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
enum_bins_numba_buffer(ints)
77.4 ms ± 2.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Impressively, this way of working with numba
outperforms a cython
version of the same function, even with bounds-checking turned off. I don't yet have enough familiarity with pythran
to test this approach using it, but I'd be interested to see a comparison. It seems likely based on this speedup that the pythran
version might also be quite a bit faster with this approach.
Here's the cython
version for reference, with some build instructions. Once you have cython
installed, you'll need a simple setup.py
file like this:
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
import numpy
ext_modules = [
Extension(
'enum_bins_cython',
['enum_bins_cython.pyx'],
)
]
setup(
ext_modules=cythonize(ext_modules),
include_dirs=[numpy.get_include()]
)
And the cython module, enum_bins_cython.pyx
:
# cython: language_level=3
import cython
import numpy
cimport numpy
@cython.boundscheck(False)
@cython.cdivision(True)
@cython.wraparound(False)
cdef void enum_bins_inner(long[:] ints, long[:] bins, long[:] starts) nogil:
cdef long i, x
for x in range(len(ints)):
i = ints[x]
bins[starts[i]] = x
starts[i] = starts[i] + 1
def enum_bins_cython(ints):
assert (ints >= 0).all()
# There might be a way to avoid storing two offset arrays and
# save memory, but `enum_bins_inner` modifies the input, and
# having separate lists of starts and ends is convenient for
# the final partition stage.
ends = numpy.bincount(ints).cumsum()
starts = numpy.empty(ends.shape, dtype=numpy.int64)
starts[1:] = ends[:-1]
starts[0] = 0
bins = numpy.empty(ints.shape, dtype=numpy.int64)
enum_bins_inner(ints, bins, starts)
starts[1:] = ends[:-1]
starts[0] = 0
return [bins[s:e] for s, e in zip(starts, ends)]
With these two files in your working directory, run this command:
python setup.py build_ext --inplace
You can then import the function using from enum_bins_cython import enum_bins_cython
.
Upvotes: 7
Reputation: 14399
Here's a really really weird way to do this that's terrible, but I found it too funny to not share - and all numpy
!
out = np.array([''] * (x.max() + 1), dtype = object)
np.add.at(out, x, ["{} ".format(i) for i in range(x.size)])
[[int(i) for i in o.split()] for o in out]
Out[]:
[[3, 4], [0, 5], [1, 2], [6], [], [7]]
EDIT: this is the best method I could find along this path. It's still 10x slower than @PaulPanzer 's argsort
solution:
out = np.empty((x.max() + 1), dtype = object)
out[:] = [[]] * (x.max() + 1)
coords = np.empty(x.size, dtype = object)
coords[:] = [[i] for i in range(x.size)]
np.add.at(out, x, coords)
list(out)
Upvotes: 6
Reputation: 3401
You can do it by making a dictionary of numbers, keys would be the numbers and values should be the indices that number seen, this is one of the fastest ways to do it, you can see the code bellow :
>>> import numpy as np
>>> a = np.array([1 ,2 ,2 ,0 ,0 ,1 ,3, 5])
>>> b = {}
# Creating an empty list for the numbers that exist in array a
>>> for i in range(np.min(a),np.max(a)+1):
b[str(i)] = []
# Adding indices to the corresponding key
>>> for i in range(len(a)):
b[str(a[i])].append(i)
# Resulting Dictionary
>>> b
{'0': [3, 4], '1': [0, 5], '2': [1, 2], '3': [6], '4': [], '5': [7]}
# Printing the result in the way you wanted.
>>> for i in sorted (b.keys()) :
print(b[i], end = " ")
[3, 4] [0, 5] [1, 2] [6] [] [7]
Upvotes: 2
Reputation: 508
Pseudocode:
get the "number of 1d arrays in the 2d array", by subtracting the minimum value of your numpy array from the maximum value and then plus one. In your case, it will be 5-0+1 = 6
initialize a 2d array with the number of 1d arrays within it. In your case, initialize a 2d array with 6 1d array in it. Each 1d array corresponds to a unique element in your numpy array, for example, the first 1d array will correspond to '0', the second 1d array will correspond to '1',...
loop through your numpy array, put the index of the element into the right corresponding 1d array. In your case, the index of the first element in your numpy array will be put to the second 1d array, the index of the second element in your numpy array will be put to the third 1d array, ....
This pseudocode will take linear time to run as it depends on the length of your numpy array.
Upvotes: 1
Reputation: 14857
One potential option depending on the size of your data is to just drop out of numpy
and use collections.defaultdict
:
In [248]: from collections import defaultdict
In [249]: d = defaultdict(list)
In [250]: l = np.random.randint(0, 100, 100000)
In [251]: %%timeit
...: for k, v in enumerate(l):
...: d[v].append(k)
...:
10 loops, best of 3: 22.8 ms per loop
Then you end up with a dictionary of {value1: [index1, index2, ...], value2: [index3, index4, ...]}
. The time scaling is pretty close to linear with the size of the array, so 10,000,000 takes ~2.7s on my machine, which seems reasonable enough.
Upvotes: 8