Anom
Anom

Reputation: 107

Numpy array: group by one column, sum another

I have an array that looks like this:

 array([[ 0,  1,  2],
        [ 1,  1,  6],
        [ 2,  2, 10],
        [ 3,  2, 14]])

I want to sum the values of the third column that have the same value in the second column, so the result is something is:

 array([[ 0,  1,  8],
        [ 1,  2, 24]])

I started coding this but I'm stuck with this sum:

import numpy as np
import sys

inFile = sys.argv[1]

with open(inFile, 'r') as t:
    f = np.genfromtxt(t, delimiter=None, names =["1","2","3"])

f.sort(order=["1","2"])
if value == previous.value:
   sum(f["3"])

Upvotes: 8

Views: 14125

Answers (7)

Mercury
Mercury

Reputation: 4171

A very neat, pure numpy solution is possible using np.histogram:

A = np.array([[ 0,  1,  2],
              [ 1,  1,  6],
              [ 2,  2, 10],
              [ 3,  2, 14]])

c1 = np.unique(A[:, 1])
c0 = np.arange(c1.shape[0])
c2 = np.histogram(A[:, 1], weights=A[:, 2], bins=c1.shape[0])[0]

result = np.c_[c0, c1, c2]

>>> result
array([[ 0,  1,  8],
       [ 1,  2, 24]])

When a weights array is provided (of the same shape as the input array) to np.histogram, any arbitrary element a[i] in the input array a will contribute weights[i] in the count for its bin.

So for example, we are counting the second column, and instead of counting 2 instances of 2, we get 10 instances of 2 + 14 instances of 2 = a count of 28 in 2's bin.

Upvotes: 5

Mad Physicist
Mad Physicist

Reputation: 114330

If your data is sorted by the second column, you can use something centered around np.add.reduceat for a pure numpy solution. A combination of np.nonzero (or np.where) applied to np.diff will give you the locations where the second column switches values. You can use those indices to do the sum-reduction. The other columns are pretty formulaic, so you can concatenate them back in fairly easily:

A = np.array([[ 0,  1,  2],
              [ 1,  1,  6],
              [ 2,  2, 10],
              [ 3,  2, 14]])
# Find the split indices
i = np.nonzero(np.diff(A[:, 1]))[0] + 1
i = np.insert(i, 0, 0)
# Compute the result columns
c0 = np.arange(i.size)
c1 = A[i, 1]
c2 = np.add.reduceat(A[:, 2], i)
# Concatenate the columns
result = np.c_[c0, c1, c2]

IDEOne Link

Notice the +1 in the indices. That is because you always want the location after the switch, not before, given how reduceat works. The insertion of zero as the first index could also be accomplished with np.r_, np.concatenate, etc.

That being said, I still think you are looking for the pandas version in @jpp's answer.

Upvotes: 7

zipa
zipa

Reputation: 27869

To get exact output use pandas:

import pandas as pd
import numpy as np

a = np.array([[ 0,  1,  2],
              [ 1,  1,  6],
              [ 2,  2, 10],
              [ 3,  2, 14]])

df = pd.DataFrame(a)
df.groupby(1).sum().reset_index().reset_index().as_matrix()
#[[ 0 1  8]
# [ 1 2 24]]

Upvotes: 0

jpp
jpp

Reputation: 164693

You can use pandas to vectorize your algorithm:

import pandas as pd, numpy as np

A = np.array([[ 0,  1,  2],
              [ 1,  1,  6],
              [ 2,  2, 10],
              [ 3,  2, 14]])

df = pd.DataFrame(A)\
       .groupby(1, as_index=False)\
       .sum()\
       .reset_index()

res = df[['index', 1, 2]].values

Result

array([[ 0,  1,  8],
       [ 2,  2, 24]], dtype=int64)

Upvotes: 5

IMCoins
IMCoins

Reputation: 3306

Here is my solution using only numpy arrays...

import numpy as np
arr = np.array([[ 0,  1,  2], [ 1,  1,  6], [ 2,  2, 10], [ 3,  2, 14]])

lst = []
compt = 0
for index in range(1, max(arr[:, 1]) + 1):
    lst.append([compt, index, np.sum(arr[arr[:, 1] == index][:, 2])])
lst = np.array(lst)
print lst
# lst, outputs...
# [[ 0  1  8]
# [ 0  2 24]]

The tricky part is the np.sum(arr[arr[:, 1] == index][:, 2]), so let's break it down to multiple parts.

  • arr[arr[:, 1] == index] means...

You have an array arr, on which we ask numpy the rows that matches the value of the for loop. Here, it is set from 1, to the maximum value of element of the 2nd column (meaning, column with index 1). Printing only this expression in the for loop results in...

# First iteration
[[0 1 2]
 [1 1 6]]
# Second iteration
[[ 2  2 10]
 [ 3  2 14]]
  • Adding [:, 2] to our expression, it means that we want the value of the 3rd column (meaning index 2), of our above lists. If I print arr[arr[:, 1] == index][:, 2], it would give me... [2, 6] at first iteration, and [10, 14] at the second.

  • I just need to sum these values using np.sum(), and to format my output list accordingly. :)

Upvotes: 1

Hirabayashi Taro
Hirabayashi Taro

Reputation: 943

You can also use a defaultdict and sum the values:

from collections import defaultdict

x = [[ 0,  1,  2],
    [ 1,  1,  6],
    [ 2,  2, 10]]

res = defaultdict(int)
for val in x:
    res[val[1]]+= val[2]
print ([[i, val,res[val]] for i, val in enumerate(res)])

Upvotes: 0

JahKnows
JahKnows

Reputation: 2706

Using a dictionary to store the values and then converting back to a list

x = [[ 0,  1,  2],
     [ 1,  1,  6],
     [ 2,  2, 10],
     [ 3,  2, 14]]

y = {}
for val in x:
    if val[1] in y:
        y[val[1]][2] += val[2]
    else:
        y.update({val[1]: val})
print([y[val] for val in y])

Upvotes: 0

Related Questions