Shawn
Shawn

Reputation: 168

Any way to speed up numpy.cumsum()

I have a project where I calculate the cumulative sum of many large arrays. At 2 seconds on a server of mine this step is a large bottleneck. Is there any way I can speed it up?
Note these arrays represent temperature measurements. So they are floating point values that can be both negative or positive. While I have more cores available I'm already using parallel processing elsewhere, so that would not speed things up in this case.

import numpy as np
import time

forcing = np.random.rand(380*1400*620).reshape((380,1400,620))

start = time.time()
forcing.cumsum(axis=0)
np_time = time.time() - start
print(np_time)
2.085033416748047

Upvotes: 1

Views: 3131

Answers (1)

ShadowRanger
ShadowRanger

Reputation: 155497

As mentioned by daniel451, numpy isn't parallelizing the cumsum operation, so you can parallelize it explicitly to gain at least a little performance.

For example, using multiprocessing.dummy (a thread-backed copy of the multiprocessing API), you could do:

import numpy as np
from multiprocessing.dummy import Pool
from itertools import repeat

forcing = np.random.rand(380*1400*620).reshape((380,1400,620))

# Make an output array of matching size, that can be populated piecemeal
# in each thread
forceres = np.zeros_like(forcing)

# Compute cumsum in parallel over second dimension
with Pool() as pool:
    # Use module function with np.rollaxis to avoid need to define
    # worker to do slicing
    pool.starmap(np.cumsum, zip(np.rollaxis(forcing, 1), repeat(0), repeat(None), np.rollaxis(forceres, 1)))

I tested this with ipython3's %time/%%time magic on an eight core machine, and found it reduced runtime vs. the original code by almost 70%, from 5.49 seconds to 1.73 seconds; your machine is clearly faster, so if the same speed up occurs on your machine I'd expect it to take ~0.66 seconds.

My comparison was:

>>> %%time
... forcesres = np.zeros_like(forcing)
... with Pool() as pool:
...     pool.starmap(np.cumsum, zip(np.rollaxis(forcing, 1), repeat(0), repeat(None), np.rollaxis(forceres, 1)))
CPU times: user 10 s, sys: 213 ms, total: 10.2 s
Wall time: 1.73 s

vs.

>>> %time forcing.cumsum(axis=0); None  # ; None avoids output
CPU times: user 5.27 s, sys: 218 ms, total: 5.49 s
Wall time: 5.49 s

Upvotes: 3

Related Questions