Shew
Shew

Reputation: 1596

Efficiently sum elements of a numpy array corresponding to indices matched by another array

I want to find sum of the rows corresponding to indexes matching another numpy.

It is better demonstrated with the following example.

A=np.array(['a-1','b-1','b-1','c-2','a-1','b-1','c-2']);
b = np.array([1.21,2.34,1.2,2.8,10.0,0.9,8.4]);;

I prefer the output to be a dictionary, such that

d['a-1'] = 1.21 + 10.0 = 11.21
d['b-1'] = 2.34 + 1.2 + 0.9 = 4.44
d['c-2'] = 2.8 + 8.4 = 11.2

the result is the sum of the elements of the b array corresponding to the indexes where same value appears in the A array. Is there an efficient way to do this ?. My arrays are large (orders of millions)

Upvotes: 3

Views: 1006

Answers (3)

Eelco Hoogendoorn
Eelco Hoogendoorn

Reputation: 10759

The numpy_indexed package (dsiclaimer: I am its author) contains functionality to perform these types of operations in an efficient and elegant manner:

import numpy_indexed as npi
k, v = npi.group_by(A).sum(b)
d = dict(zip(k, v))

I feel that pandas is quite clunky with its grouping syntax; and that it shouldnt be necessary to reorganize your data into a new datastructure to perform such an elementary operation.

Upvotes: 0

Divakar
Divakar

Reputation: 221514

Approach #1

We can use a combination of np.unique and np.bincount -

In [48]: unq, ids = np.unique(A, return_inverse=True)

In [49]: dict(zip(unq, np.bincount(ids, b)))
Out[49]: 
{'a-1': 11.210000000000001,
 'b-1': 4.4400000000000004,
 'c-2': 11.199999999999999}

So, np.unique gives us unique integer mapping for each of the strings in A, which are then fed to np.bincount that uses those integers as bins for bin based weighted summations, with weights from b.

Approach #2 (Specific case)

Assuming that the strings in A are always of 3 characters, a faster way would be with converting those strings to numerals and then use those as the input to np.unique. The idea is that np.unique would work faster with numerals than strings.

Hence, the implementation would be -

In [141]: n = A.view(np.uint8).reshape(-1,3).dot(256**np.arange(3))

In [142]: unq, st, ids = np.unique(n, return_index=1, return_inverse=1)

In [143]: dict(zip(A[st], np.bincount(ids, b)))
Out[143]: 
{'a-1': 11.210000000000001,
 'b-1': 4.4400000000000004,
 'c-2': 11.199999999999999}

The magical part is that the viewing after reshaping stays as a view and as such should be pretty efficient :

In [150]: np.shares_memory(A,A.view(np.uint8).reshape(-1,3))
Out[150]: True

Or we could use the axis parameter of np.unique (functionality added in 1.13.0) -

In [160]: A2D = A.view(np.uint8).reshape(-1,3)

In [161]: unq, st, ids = np.unique(A2D, axis=0, return_index=1, return_inverse=1)

In [162]: dict(zip(A[st], np.bincount(ids, b)))
Out[162]: 
{'a-1': 11.210000000000001,
 'b-1': 4.4400000000000004,
 'c-2': 11.199999999999999}

Upvotes: 4

GPhilo
GPhilo

Reputation: 19123

An alternative approach, using pandas:

import pandas as pd
df = pd.DataFrame(data=[pd.Series(A),pd.Series(b)]).transpose()
res = df.groupby(0).sum()

gives

res
Out[62]: 
         1
0         
a-1  11.21
b-1   4.44
c-2  11.20

You can get the dict you'd like to have like this:

res_dict = res[1].to_dict()

Which gives

Out[64]: 
{'a-1': 11.210000000000001,
 'b-1': 4.4400000000000004,
 'c-2': 11.199999999999999}

Upvotes: 2

Related Questions