Maxim Imakaev
Maxim Imakaev

Reputation: 1545

Indexing numpy record arrays is very slow

It looks like indexing numpy record arrays with an array of indices is outrageously slow. However, the same operation can be performed using np.view 10-15 times faster.

Is there a reason behind this difference? Why isn't indexing of record arrays implemented in a faster way? (see also sorting numpy structured and record arrays is very slow)

mydtype = np.dtype("i4,i8")
mydtype.names = ("foo","bar")
N = 100000

foobar = np.zeros(N,dtype = mydtype)
foobar["foo"] = np.random.randint(0,100,N)
foobar["bar"] = np.random.randint(0,10000,N)

b = np.lexsort((foobar["foo"],foobar["bar"]))

timeit foobar[b]
100 loops, best of 3: 11.2 ms per loop

timeit foobar.view("|S12")[b].view(mydtype)
1000 loops, best of 3: 882 µs per loop

Obviously, both results give the same answer.

Upvotes: 4

Views: 995

Answers (1)

hpaulj
hpaulj

Reputation: 231375

take, as mentioned in https://stackoverflow.com/a/23303357/901925, is even faster than your double view approach:

np.take(foobar,b)

In fact it's as fast as

foobar['foo'][b]

https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/item_selection.c is a starting point if you want to dig further in to the source code.

My guess is that something in how __getitem__ is implemented causes this difference. Maybe as a remnant of earlier record processing it takes a different path when the dtype is mixed (and for advanced indexing).

Boolean mask indexing doesn't seem to be affected by this slow down. Same for basic sliced indexing.

Upvotes: 3

Related Questions