Reputation: 1682
here is what my data looks like:
a = np.array([[1,2],[2,1],[7,1],[3,2]])
I want to sum for each number in the second row here. So, in the example, there are two possible values in second column: 1 and 2.
I want to sum all values in the first column that have the same value in second column. Is there an inbuilt numpy function for this?
For example a sum for each 1 in the second column would be: 2 + 7 = 9
Upvotes: 1
Views: 558
Reputation: 3936
As far as I know, there is no function to do this in numpy
, but this can easily be done with pandas.DataFrame.groupby
.
In [7]: import pandas as pd
In [8]: import numpy as np
In [9]: a = np.array([[1,2],[2,1],[7,1],[3,2]])
In [10]: df = pd.DataFrame(a)
In [11]: df.groupby(1)[0].sum()
Out[11]:
1
1 9
2 4
Name: 0, dtype: int64
Of course, you could do the same thing with itertools.groupby
In [1]: import numpy as np
...: from itertools import groupby
...: from operator import itemgetter
...:
In [3]: a = np.array([[1,2],[2,1],[7,1],[3,2]])
In [4]: sa = sorted(a.tolist(), key=itemgetter(1))
In [5]: grouper = groupby(sa, key=itemgetter(1))
In [6]: sums = {idx : sum(row[0] for row in group) for idx, group in grouper}
In [7]: sums
Out[7]: {1: 9, 2: 4}
Upvotes: 1
Reputation: 7756
You might want to have a look at this library: https://github.com/ml31415/accumarray . It's a clone from matlabs accumarray for numpy.
a = np.array([[1,2],[2,1],[7,1],[3,2]])
accum(a[:,1], a[:,0])
>>> array([0, 9, 4])
The first 0 means, that there were no fields with 0 in the index column.
Upvotes: 2
Reputation: 10219
Contents of play.py
import numpy as np
def compute_sum1(a):
unique = np.unique(a[:, 1])
same_idxs = ((u, np.argwhere(a[:, 1] == u)) for u in unique)
# First coordinate of tuple contains value of col 2
# Second coordinate contains the sum of entries from col 1
same_sum = [(u, np.sum(a[idx, 0])) for u, idx in same_idxs]
return same_sum
def compute_sum2(a):
"""A minimal implementation of compute_sum"""
unique = np.unique(a[:, 1])
same_idxs = (np.argwhere(a[:, 1] == u) for u in unique)
same_sum = (np.sum(a[idx, 0]) for idx in same_idxs)
return same_sum
def compute_sum3(a):
unique = np.unique(a[:, 1])
same_idxs = [np.argwhere(a[:, 1] == u) for u in unique]
same_sum = np.sum(a[same_idxs, 0].squeeze(), 1)
return same_sum
def main():
a = np.array([[1,2],[2,1],[7,1],[3,2]]).astype("float")
print("compute_sum1")
print(compute_sum1(a))
print("compute_sum3")
print(compute_sum3(a))
print("compute_sum2")
same_sum = [s for s in compute_sum2(a)]
print(same_sum)
if __name__ == '__main__':
main()
Output:
In [59]: play.main()
compute_sum1
[(1.0, 9.0), (2.0, 4.0)]
compute_sum3
[ 9. 4.]
compute_sum2
[9.0, 4.0]
In [60]: %timeit play.compute_sum1(a)
10000 loops, best of 3: 95 µs per loop
In [61]: %timeit play.compute_sum2(a)
100000 loops, best of 3: 14.1 µs per loop
In [62]: %timeit play.compute_sum3(a)
10000 loops, best of 3: 77.4 µs per loop
Note that compute_sum2()
is the fastest.
If your matrix is huge, I suggest using this implementation as it uses generator comprehension instead of list comprehension, which is more memory efficient.
Similarly, same_sum
in compute_sum1()
can be converted to a generator comprehension by replacing []
with ()
.
Upvotes: 2
Reputation: 17871
A short but a bit dodgy way is through numpy function bincount:
np.bincount(a[:,1], weights=a[:,0])
What it does is counts the number of occurrences of 0, 1, 2, etc in the array (in this case, a[:,1]
which is the list of your category numbers). Now, weights
is multiplying the count by some weight which is in this case your first value in a list, essentially making a sum this way.
What it return is this:
array([ 0., 9., 4.])
where 0 is the sum of first elements where the second element is 0, etc... So, it will only work if your second numbers by which you group are integers.
If they are not consecutive integers starting from 0, you can select those you need by doing:
np.bincount(a[:,1], weights=a[:,0])[np.unique(a[:,1])]
This will return
array([9., 4.])
which is an array of sums, sorted by the second element (because unique
returns a sorted list).
If your second elements are not integers, first off you are in some kind of trouble because of floating point arithmetic (elements which you think are equal could be different in reality). However, if you are sure it is fine, you can sort them and assign integers to them (using scipy's rank
function, for example):
ind = rd(a[:,1], method = 'dense').astype(int) - 1 # ranking begins from 1, we need from 0
sums = np.bincount(ind, weights=a[:,0])
This will return array([9., 4.])
, in order sorted by your second element. You can zip them to pair sums with appropriate elements:
zip(np.unique(a[:,1]), sums)
Upvotes: 2
Reputation: 17871
The easiest straightforward way I see is though list comprehension:
s = [[sum(x[0] for x in a if x[1] == y), y] for y in set([q[1] for q in a])]
However, if the second number in your lists represents some kind of a category, I suggest you convert your data into a dictionary.
Upvotes: 1