goclem
goclem

Reputation: 954

Summarize 2darray by index

I want to sum columns of a 2d array dat by row index idx. The following example works but is slow for large arrays. Any idea to speed it up?

import numpy as np

dat = np.arange(18).reshape(6, 3, order = 'F')
idx = np.array([0, 1, 1, 1, 2, 2])

for i in np.unique(idx):
    print(np.sum(dat[idx==i], axis = 0))

Output

[ 0  6 12]
[ 6 24 42]
[ 9 21 33]

Upvotes: 1

Views: 55

Answers (2)

Divakar
Divakar

Reputation: 221584

Approach #1

We can leverage matrix-multiplication with np.dot -

In [56]: mask = idx[:,None] == np.unique(idx)

In [57]: mask.T.dot(dat)
Out[57]: 
array([[ 0,  6, 12],
       [ 6, 24, 42],
       [ 9, 21, 33]])

Approach #2

For the case with idx already sorted, we can use np.add.reduceat -

In [52]: p = np.flatnonzero(np.r_[True,idx[:-1] != idx[1:]])

In [53]: np.add.reduceat(dat, p, axis=0)
Out[53]: 
array([[ 0,  6, 12],
       [ 6, 24, 42],
       [ 9, 21, 33]])

Upvotes: 2

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

A bit faster approach with set object and ndarray.sum() method:

In [216]: for i in set(idx):
     ...:     print(dat[idx == i].sum(axis=0))
     ...:     
[ 0  6 12]
[ 6 24 42]
[ 9 21 33]

Time execution comparison:

In [217]: %timeit for i in np.unique(idx): r = np.sum(dat[idx==i], axis = 0)
109 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [218]: %timeit for i in set(idx): r = dat[idx == i].sum(axis=0)
71.1 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Upvotes: 0

Related Questions