kiriko09
kiriko09

Reputation: 13

Perform numpy.sum (or scipy.integrate.simps()) on large splitted array efficiently

Let's consider a very large numpy array a (M, N). where M can typically be 1 or 100 and N 10-100,000,000

We have the array of indices that can split it into many (K = 1,000,000) along axis=1.

We want to efficiently perform an operation like integration along axis=1 (np.sum to take the simplest form) on each sub-array and return a (M, K) array.

An elegant and efficient solution was proposed by @Divakar in question [41920367]how to split numpy array and perform certain actions on split arrays [Python] but my understanding is that it only applies to cases where all sub-arrays have the same shape, which allows for reshaping.

But in our case the sub-arrays don't have the same shape, which, so far has forced me to loop on the index... please take me out of my misery...

Example

a = np.random.random((10, 100000000))
ind = np.sort(np.random.randint(10, 9000000, 1000000))

The size of the sub-arrays are not homogenous:

sizes = np.diff(ind)
print(sizes.min(), size.max())
2, 8732

So far, the best I found is:

output = np.concatenate([np.sum(vv, axis=1)[:, None] for vv in np.split(a, ind, axis=1)], axis=1)

Possible feature request for numpy and scipy:

If looping is really unavoidable, at least having it done in C inside the numpy and scipy.integrate.simps (or romb) functions would probably speed-up the output. Something like

output = np.sum(a, axis=1, split_ind=ind)
output = scipy.integrate.simps(a, x=x, axis=1, split_ind=ind)
output = scipy.integrate.romb(a, x=x, axis=1, split_ind=ind)

would be very welcome ! (where x itself could be splitable, or not)

Side note:

While trying this example, I noticed that with these numbers there was almost always an element of sizes equal to 0 (the sizes.min() is almost always zero). This looks peculiar to me, as we are picking 10,000 integers between 10 and 9,000,000, the odds that the same number comes up twice (such that diff = 0) should be close to 0. It seems to be very close to 1. Would that be due to the algorithm behind np.random.randint ?

Upvotes: 1

Views: 947

Answers (2)

kiriko09
kiriko09

Reputation: 13

Great !

Thanks ! on my VM Cent OS 6.9 I have the following results:

In [71]: a = np.random.random((10, 10000000))

In [72]: ind = np.unique(np.random.randint(10, 9000000, 100000))

In [73]: ind2 = np.append([0], ind)

In [74]: out = np.concatenate([np.sum(vv, axis=1)[:, None] for vv in np.split(a, ind, axis=1)], axis=1)

In [75]: out2 = np.add.reduceat(a, ind2, axis=1)

In [83]: np.allclose(out, out2)
Out[83]: True

In [84]: %timeit out = np.concatenate([np.sum(vv, axis=1)[:, None] for vv in np.split(a, ind, axis=1)], axis=1)
2.7 s ± 40.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [85]: %timeit out2 = np.add.reduceat(a, ind2, axis=1)
179 ms ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That's a good 93 % speed gain (or factor 15 faster) over the list concatenation :-) Great !

Upvotes: 0

Daniel F
Daniel F

Reputation: 14399

What you want is np.add.reduceat

output = np.add.reduceat(a, ind, axis = 1)

output.shape
Out[]: (10, 1000000)

Universal Functions (ufunc) are a very powerful tool in numpy

As for the repeated indices, that's simply the Birthday Problem cropping up.

Upvotes: 1

Related Questions