Reputation: 1596
I want to find sum of the rows corresponding to indexes matching another numpy.
It is better demonstrated with the following example.
A=np.array(['a-1','b-1','b-1','c-2','a-1','b-1','c-2']);
b = np.array([1.21,2.34,1.2,2.8,10.0,0.9,8.4]);;
I prefer the output to be a dictionary, such that
d['a-1'] = 1.21 + 10.0 = 11.21
d['b-1'] = 2.34 + 1.2 + 0.9 = 4.44
d['c-2'] = 2.8 + 8.4 = 11.2
the result is the sum of the elements of the b array corresponding to the indexes where same value appears in the A array. Is there an efficient way to do this ?. My arrays are large (orders of millions)
Upvotes: 3
Views: 1006
Reputation: 10759
The numpy_indexed package (dsiclaimer: I am its author) contains functionality to perform these types of operations in an efficient and elegant manner:
import numpy_indexed as npi
k, v = npi.group_by(A).sum(b)
d = dict(zip(k, v))
I feel that pandas is quite clunky with its grouping syntax; and that it shouldnt be necessary to reorganize your data into a new datastructure to perform such an elementary operation.
Upvotes: 0
Reputation: 221514
Approach #1
We can use a combination of np.unique
and np.bincount
-
In [48]: unq, ids = np.unique(A, return_inverse=True)
In [49]: dict(zip(unq, np.bincount(ids, b)))
Out[49]:
{'a-1': 11.210000000000001,
'b-1': 4.4400000000000004,
'c-2': 11.199999999999999}
So, np.unique
gives us unique integer mapping for each of the strings in A
, which are then fed to np.bincount
that uses those integers as bins for bin based weighted summations, with weights from b
.
Approach #2 (Specific case)
Assuming that the strings in A
are always of 3
characters, a faster way would be with converting those strings to numerals and then use those as the input to np.unique
. The idea is that np.unique
would work faster with numerals than strings.
Hence, the implementation would be -
In [141]: n = A.view(np.uint8).reshape(-1,3).dot(256**np.arange(3))
In [142]: unq, st, ids = np.unique(n, return_index=1, return_inverse=1)
In [143]: dict(zip(A[st], np.bincount(ids, b)))
Out[143]:
{'a-1': 11.210000000000001,
'b-1': 4.4400000000000004,
'c-2': 11.199999999999999}
The magical part is that the viewing
after reshaping stays as a view and as such should be pretty efficient :
In [150]: np.shares_memory(A,A.view(np.uint8).reshape(-1,3))
Out[150]: True
Or we could use the axis
parameter of np.unique
(functionality added in 1.13.0
) -
In [160]: A2D = A.view(np.uint8).reshape(-1,3)
In [161]: unq, st, ids = np.unique(A2D, axis=0, return_index=1, return_inverse=1)
In [162]: dict(zip(A[st], np.bincount(ids, b)))
Out[162]:
{'a-1': 11.210000000000001,
'b-1': 4.4400000000000004,
'c-2': 11.199999999999999}
Upvotes: 4
Reputation: 19123
An alternative approach, using pandas:
import pandas as pd
df = pd.DataFrame(data=[pd.Series(A),pd.Series(b)]).transpose()
res = df.groupby(0).sum()
gives
res
Out[62]:
1
0
a-1 11.21
b-1 4.44
c-2 11.20
You can get the dict you'd like to have like this:
res_dict = res[1].to_dict()
Which gives
Out[64]:
{'a-1': 11.210000000000001,
'b-1': 4.4400000000000004,
'c-2': 11.199999999999999}
Upvotes: 2