Botond
Botond

Reputation: 2802

NumPy array sum reduce

I have a numpy array with three columns of the form:

x1 y1 f1


x2 y2 f2


...

xn yn fn

The (x,y) pairs may repeat. I would need another array such that each (x,y) pair appears once and the corresponding third column is the sum of all the f values that appeared next to (x,y).

For example, the array

1 2 4.0

1 1 5.0

1 2 3.0

0 1 9.0

would give

0 1 9.0

1 1 5.0

1 2 7.0

The order of rows is not relevant. What is the fastest way to do this in Python?

Thank you!

Upvotes: 3

Views: 3968

Answers (4)

Botond
Botond

Reputation: 2802

Thanks to @hpaulj, finally found the simplest solution. If d contains the 3-column data:

ind =d[0:2].astype(int)
x = zeros(shape=(N,N))
add.at(x,list(ind),d[2])

This solution assumes that the (x,y) indices in the first two columns are integer and smaller than N. This is what I need and should have mentioned in the post.

Edit: Note that the above solution produces a sparse matrix with the sum values at position (x,y) within the matrix.

Upvotes: 1

hpaulj
hpaulj

Reputation: 231395

If you have scipy, the sparse module does this kind of addition - again for an array where the 1st 2 columns are integers - ie. indexes.

from scipy import sparse
M = sparse.csr_matrix((d[:,0], (d[:,1],d[:,2])))
M = M.tocoo() # there may be a short cut to this csr coo round trip
x = np.column_stack([M.row, M.col, M.data]) # needs testing

For convenience in constructing certain kinds of linear algebra matrices, the csr sparse array format sums values with duplicate indices. It's implemented in compiled code so should be fairly fast. But putting the data into M and taking it back out might slow it down.

(ps. I haven't tested this script since I'm writing this on a machine without scipy).

Upvotes: 0

dawg
dawg

Reputation: 103844

Certainly easily done in Python:

arr = np.array([[1,2,4.0],
                [1,1,5.0],
                [1,2,3.0],
                [0,1,9.0]])
d={}                
for x, y, z in arr:
    d.setdefault((x,y), 0)
    d[x,y]+=z     

>>> d
{(1.0, 2.0): 7.0, (0.0, 1.0): 9.0, (1.0, 1.0): 5.0}

Then translate back to numpy:

>>> np.array([[x,y,d[(x,y)]] for x,y in d.keys()]) 
array([[ 1.,  2.,  7.],
       [ 0.,  1.,  9.],
       [ 1.,  1.,  5.]])

Upvotes: 0

Divakar
Divakar

Reputation: 221574

This would be one approach to solve it -

import numpy as np

# Input array
A = np.array([[1,2,4.0],
             [1,1,5.0],
             [1,2,3.0],
             [0,1,9.0]])

# Extract xy columns            
xy = A[:,0:2]

# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(xy.T)
sorted_xy =  xy[sorted_idx,:]

# Differentiation along rows for sorted array
df1 = np.diff(sorted_xy,axis=0)
df2 = np.append([True],np.any(df1!=0,1),0)
# OR df2 = np.append([True],np.logical_or(df1[:,0]!=0,df1[:,1]!=0),0)
# OR df2 = np.append([True],np.dot(df1!=0,[True,True]),0)

# Get unique sorted labels
sorted_labels = df2.cumsum(0)-1

# Get labels
labels = np.zeros_like(sorted_idx)
labels[sorted_idx] = sorted_labels

# Get unique indices
unq_idx  = sorted_idx[df2]

# Get counts and unique rows and setup output array
counts = np.bincount(labels, weights=A[:,2])
unq_rows = xy[unq_idx,:]
out = np.append(unq_rows,counts.ravel()[:,None],1)

Input & Output -

In [169]: A
Out[169]: 
array([[ 1.,  2.,  4.],
       [ 1.,  1.,  5.],
       [ 1.,  2.,  3.],
       [ 0.,  1.,  9.]])

In [170]: out
Out[170]: 
array([[ 0.,  1.,  9.],
       [ 1.,  1.,  5.],
       [ 1.,  2.,  7.]])

Upvotes: 2

Related Questions