Reputation: 1596

Efficiently sum elements of a numpy array corresponding to indices matched by another array

I want to find sum of the rows corresponding to indexes matching another numpy.

It is better demonstrated with the following example.

A=np.array(['a-1','b-1','b-1','c-2','a-1','b-1','c-2']);
b = np.array([1.21,2.34,1.2,2.8,10.0,0.9,8.4]);;

I prefer the output to be a dictionary, such that

d['a-1'] = 1.21 + 10.0 = 11.21
d['b-1'] = 2.34 + 1.2 + 0.9 = 4.44
d['c-2'] = 2.8 + 8.4 = 11.2

the result is the sum of the elements of the b array corresponding to the indexes where same value appears in the A array. Is there an efficient way to do this ?. My arrays are large (orders of millions)

Upvotes: 3

Answers (3)

Eelco Hoogendoorn

Reputation: 10769

The numpy_indexed package (dsiclaimer: I am its author) contains functionality to perform these types of operations in an efficient and elegant manner:

import numpy_indexed as npi
k, v = npi.group_by(A).sum(b)
d = dict(zip(k, v))

I feel that pandas is quite clunky with its grouping syntax; and that it shouldnt be necessary to reorganize your data into a new datastructure to perform such an elementary operation.

Upvotes: 0

Divakar

Reputation: 221744

Approach #1

We can use a combination of np.unique and np.bincount -

In [48]: unq, ids = np.unique(A, return_inverse=True)

In [49]: dict(zip(unq, np.bincount(ids, b)))
Out[49]: 
{'a-1': 11.210000000000001,
 'b-1': 4.4400000000000004,
 'c-2': 11.199999999999999}

So, np.unique gives us unique integer mapping for each of the strings in A, which are then fed to np.bincount that uses those integers as bins for bin based weighted summations, with weights from b.

Approach #2 (Specific case)

Assuming that the strings in A are always of 3 characters, a faster way would be with converting those strings to numerals and then use those as the input to np.unique. The idea is that np.unique would work faster with numerals than strings.

Hence, the implementation would be -

In [141]: n = A.view(np.uint8).reshape(-1,3).dot(256**np.arange(3))

In [142]: unq, st, ids = np.unique(n, return_index=1, return_inverse=1)

In [143]: dict(zip(A[st], np.bincount(ids, b)))
Out[143]: 
{'a-1': 11.210000000000001,
 'b-1': 4.4400000000000004,
 'c-2': 11.199999999999999}

The magical part is that the viewing after reshaping stays as a view and as such should be pretty efficient :

In [150]: np.shares_memory(A,A.view(np.uint8).reshape(-1,3))
Out[150]: True

Or we could use the axis parameter of np.unique (functionality added in 1.13.0) -

In [160]: A2D = A.view(np.uint8).reshape(-1,3)

In [161]: unq, st, ids = np.unique(A2D, axis=0, return_index=1, return_inverse=1)

In [162]: dict(zip(A[st], np.bincount(ids, b)))
Out[162]: 
{'a-1': 11.210000000000001,
 'b-1': 4.4400000000000004,
 'c-2': 11.199999999999999}

Upvotes: 4

GPhilo

Reputation: 19163

An alternative approach, using pandas:

import pandas as pd
df = pd.DataFrame(data=[pd.Series(A),pd.Series(b)]).transpose()
res = df.groupby(0).sum()

gives

res
Out[62]: 
         1
0         
a-1  11.21
b-1   4.44
c-2  11.20

You can get the dict you'd like to have like this:

res_dict = res[1].to_dict()

Which gives

Out[64]: 
{'a-1': 11.210000000000001,
 'b-1': 4.4400000000000004,
 'c-2': 11.199999999999999}

Upvotes: 2

Efficiently sum elements of a numpy array corresponding to indices matched by another array

Answers (3)

Related Questions