Reputation: 598

Calculate partitioned sum efficiently with CuPy or NumPy

I have a very long array* of length L (let's call it values) that I want to sum over, and a sorted 1D array of the same length L that contains N integers with which to partition the original array – let's call this array labels.

What I'm currently doing is this (module being cupy or numpy):

result = module.empty(N)
for i in range(N):
    result[i] = values[labels == i].sum()

But this can't be the most efficient way of doing it (it should be possible to get rid of the for loop, but how?). Since labels is sorted, I could easily determine the break points and use those indices as start/stop points, but I don't see how this solves the for loop problem.

Note that I would like to avoid creating an array of size NxL along the way, if possible, since L is very large.

I'm working in cupy, but any numpy solution is welcome too and could probably be ported. Within cupy, it seems this would be a case for a ReductionKernel, but I don't quite see how to do it.

* in my case, values is 1D, but I assume the solution wouldn't depend on this

Upvotes: 1

Answers (2)

amzon-ex

Reputation: 1744

Here is a solution which, instead of a N x L array, uses a N x <max partition size in labels> array (which should not be large, if the disparity between different partitions is not too high):

Resize the array into a 2-D array with partitions in each row. Since the length of the row equals the size of the maximum partition, fill unavailable values with zeros (since it doesn't affect any sum). This uses @Divakar's solution given here.

def jagged_to_regular(a, parts):
    lens = np.ediff1d(parts,to_begin=parts[0])
    mask = lens[:,None]>np.arange(lens.max())
    out = np.zeros(mask.shape, dtype=a.dtype)
    out[mask] = a
    return out

parts_stack = jagged_to_regular(values, labels)

Sum along axis 1:

result = np.sum(parts_stack, axis = 1)

In case you'd like a CuPy implementation, there's no direct CuPy alternative to numpy.ediff1d in jagged_to_regular. In that case, you can substitute the statement with numpy.diff like so:

lens = np.insert(np.diff(parts), 0, parts[0])

and then continue to use CuPy as a drop-in replacement for numpy.

Upvotes: 1

Nick Becker

Reputation: 4224

You are describing a groupby sum aggregation. You could write a CuPy RawKernel for this, but it would be much easier to use the existing groupby aggregations implemented in cuDF, the GPU dataframe library. They can interoperate without requiring you to copy the data. If you call .values on the resulting cuDF Series, it will give you a CuPy array.

If you went back to the CPU, you could do the same thing with pandas.

import cupy as cp
import pandas as pd

N = 100
values = cp.random.randint(0, N, 1000)
labels = cp.sort(cp.random.randint(0, N, 1000))

L = len(values)
result = cp.empty(L)
for i in range(N):
    result[i] = values[labels == i].sum()
    
result[:5]
array([547., 454., 402., 601., 668.])

import cudf

df = cudf.DataFrame({"values": values, "labels": labels})
df.groupby(["labels"])["values"].sum().values[:5]
array([547, 454, 402, 601, 668])

Upvotes: 2

Calculate partitioned sum efficiently with CuPy or NumPy

Answers (2)

Related Questions