Salvia LU
Salvia LU

Reputation: 41

python-numpy aggregate values in array by group

Assuming there is a 2D-array and it is divided into several sub-regions as regions. there is another array filled with values. I would like to aggregate the values by sub-regions. The following code is my solution.

But, when the number of sub-regions is very large, the iteration costs much time. I would like to ask if there is any way to accelerate the programm? I suppose maybe numpy could do this, but I don't know how to do.

import numpy as np

regions = np.array([[0,0,0,1],
                    [0,0,1,1],
                    [1,1,1,2],
                    [2,2,2,2]], dtype=np.int32)
value_array = np.array([[9,5,8,4],
                        [6,4,8,5],
                        [4,5,9,7],
                        [4,7,3,0]], dtype=np.float32)

aggre_array = np.zeros_like(value_array)
for r in range(regions.max()+1):
    region = regions==r
    aggre_array[region] = value_array[region].mean()
print(aggre_array)
'''output
[[6.4       6.4       6.4       5.8333335]
 [6.4       6.4       5.8333335 5.8333335]
 [5.8333335 5.8333335 5.8333335 4.2      ]
 [4.2       4.2       4.2       4.2      ]]
'''

Upvotes: 4

Views: 401

Answers (1)

mathfux
mathfux

Reputation: 5949

In such kind of grouping you need to work with flattened variants of arrays sorted by indices that sorts first array, like:

regions_sort = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]
value_array_sort = [9, 5, 8, 6, 4, 4, 8, 5, 4, 5, 9, 7, 4, 7, 3, 0]

The latter part is to find indices that separates individual groups and apply it for further calcutions of groups sums, counts and means:

marker_idx = [5, 11]
group_counts = [5, 6, 5]
group_sums = [32, 35, 21]
group_means = [6.4, 5.83333, 4.2]

Finally, repeat values so that it fits value_array, rearrange them in inverse order and reshape it to initial shape.

sorter = np.argsort(regions.ravel())
_, inverse_sorter = np.unique(sorter, return_index=True) #could be optimised...
regions_sort = regions.ravel()[sorter]
value_array_sort = value_array.ravel()[sorter]

marker_idx = np.flatnonzero(np.diff(regions_sort))+1
reduceat_idx = np.r_[0, marker_idx]
group_counts = np.diff(marker_idx, prepend=0, append=regions.size) #could also use np.bincount...
group_sums = np.add.reduceat(regions_sort, reduceat_idx)
group_means = group_sums / group_counts

new_values = np.repeat(group_means, group_counts)
new_value_array = new_values[inverse_sorter].reshape(value_array.shape)

>>> new_value_array    
array([[6.4       , 6.4       , 6.4       , 5.83333333],
       [6.4       , 6.4       , 5.83333333, 5.83333333],
       [5.83333333, 5.83333333, 5.83333333, 4.2       ],
       [4.2       , 4.2       , 4.2       , 4.2       ]])

I've also found a way to do it numpy_indexed package designed for solving grouping problems in efficient ways:

import numpy_indexed as npi
groupby = npi.group_by(regions.ravel())
keys, values = groupby.mean(value_array.ravel())
>>> values[groupby.inverse].reshape(regions.shape)

array([[6.4       , 6.4       , 6.4       , 5.83333333],
       [6.4       , 6.4       , 5.83333333, 5.83333333],
       [5.83333333, 5.83333333, 5.83333333, 4.2       ],
       [4.2       , 4.2       , 4.2       , 4.2       ]])

Upvotes: 1

Related Questions