Reputation: 1682

Manipulating data from python Numpy array: Using values from one column to sum over adjacent value

here is what my data looks like:

a = np.array([[1,2],[2,1],[7,1],[3,2]])

I want to sum for each number in the second row here. So, in the example, there are two possible values in second column: 1 and 2.

I want to sum all values in the first column that have the same value in second column. Is there an inbuilt numpy function for this?

For example a sum for each 1 in the second column would be: 2 + 7 = 9

Upvotes: 1

Answers (5)

JaminSore

Reputation: 3936

As far as I know, there is no function to do this in numpy, but this can easily be done with pandas.DataFrame.groupby.

In [7]: import pandas as pd
In [8]: import numpy as np
In [9]: a = np.array([[1,2],[2,1],[7,1],[3,2]])
In [10]: df = pd.DataFrame(a)
In [11]: df.groupby(1)[0].sum()
Out[11]: 
1
1    9
2    4
Name: 0, dtype: int64

Of course, you could do the same thing with itertools.groupby

In [1]: import numpy as np
   ...: from itertools import groupby
   ...: from operator import itemgetter
   ...: 

In [3]: a = np.array([[1,2],[2,1],[7,1],[3,2]])

In [4]: sa = sorted(a.tolist(), key=itemgetter(1))

In [5]: grouper = groupby(sa, key=itemgetter(1))

In [6]: sums = {idx : sum(row[0] for row in group) for idx, group in grouper}

In [7]: sums
Out[7]: {1: 9, 2: 4}

Upvotes: 1

Michael

Reputation: 7756

You might want to have a look at this library: https://github.com/ml31415/accumarray . It's a clone from matlabs accumarray for numpy.

a = np.array([[1,2],[2,1],[7,1],[3,2]])
accum(a[:,1], a[:,0])
>>> array([0, 9, 4])

The first 0 means, that there were no fields with 0 in the index column.

Upvotes: 2

lightalchemist

Reputation: 10219

Contents of play.py

import numpy as np

def compute_sum1(a):
    unique = np.unique(a[:, 1])
    same_idxs = ((u, np.argwhere(a[:, 1] == u)) for u in unique)
    # First coordinate of tuple contains value of col 2
    # Second coordinate contains the sum of entries from col 1
    same_sum = [(u, np.sum(a[idx, 0])) for u, idx in same_idxs]
    return same_sum

def compute_sum2(a):
    """A minimal implementation of compute_sum"""
    unique = np.unique(a[:, 1])
    same_idxs = (np.argwhere(a[:, 1] == u) for u in unique)
    same_sum = (np.sum(a[idx, 0]) for idx in same_idxs)
    return same_sum

def compute_sum3(a):
    unique = np.unique(a[:, 1])
    same_idxs = [np.argwhere(a[:, 1] == u) for u in unique]
    same_sum = np.sum(a[same_idxs, 0].squeeze(), 1)
    return same_sum

def main():
    a = np.array([[1,2],[2,1],[7,1],[3,2]]).astype("float")
    print("compute_sum1")
    print(compute_sum1(a))
    print("compute_sum3")
    print(compute_sum3(a))
    print("compute_sum2")
    same_sum = [s for s in compute_sum2(a)]
    print(same_sum)


if __name__ == '__main__':
    main()

Output:

In [59]: play.main()
compute_sum1
[(1.0, 9.0), (2.0, 4.0)]
compute_sum3
[ 9.  4.]
compute_sum2
[9.0, 4.0]

In [60]: %timeit play.compute_sum1(a)
10000 loops, best of 3: 95 µs per loop

In [61]: %timeit play.compute_sum2(a)
100000 loops, best of 3: 14.1 µs per loop

In [62]: %timeit play.compute_sum3(a)
10000 loops, best of 3: 77.4 µs per loop

Note that compute_sum2() is the fastest. If your matrix is huge, I suggest using this implementation as it uses generator comprehension instead of list comprehension, which is more memory efficient. Similarly, same_sum in compute_sum1() can be converted to a generator comprehension by replacing [] with ().

Upvotes: 2

sashkello

Reputation: 17871

A short but a bit dodgy way is through numpy function bincount:

np.bincount(a[:,1], weights=a[:,0])

What it does is counts the number of occurrences of 0, 1, 2, etc in the array (in this case, a[:,1] which is the list of your category numbers). Now, weights is multiplying the count by some weight which is in this case your first value in a list, essentially making a sum this way.

What it return is this:

array([ 0.,  9.,  4.])

where 0 is the sum of first elements where the second element is 0, etc... So, it will only work if your second numbers by which you group are integers.

If they are not consecutive integers starting from 0, you can select those you need by doing:

np.bincount(a[:,1], weights=a[:,0])[np.unique(a[:,1])]

This will return

array([9.,  4.])

which is an array of sums, sorted by the second element (because unique returns a sorted list).

If your second elements are not integers, first off you are in some kind of trouble because of floating point arithmetic (elements which you think are equal could be different in reality). However, if you are sure it is fine, you can sort them and assign integers to them (using scipy's rank function, for example):

ind = rd(a[:,1], method = 'dense').astype(int) - 1 # ranking begins from 1, we need from 0
sums = np.bincount(ind, weights=a[:,0])

This will return array([9., 4.]), in order sorted by your second element. You can zip them to pair sums with appropriate elements:

zip(np.unique(a[:,1]), sums)

Upvotes: 2

sashkello

Reputation: 17871

The easiest straightforward way I see is though list comprehension:

s = [[sum(x[0] for x in a if x[1] == y), y] for y in set([q[1] for q in a])]

However, if the second number in your lists represents some kind of a category, I suggest you convert your data into a dictionary.

Upvotes: 1

Manipulating data from python Numpy array: Using values from one column to sum over adjacent value

Answers (5)

Related Questions