Reputation: 168
I have a project where I calculate the cumulative sum of many large arrays. At 2 seconds on a server of mine this step is a large bottleneck. Is there any way I can speed it up?
Note these arrays represent temperature measurements. So they are floating point values that can be both negative or positive. While I have more cores available I'm already using parallel processing elsewhere, so that would not speed things up in this case.
import numpy as np
import time
forcing = np.random.rand(380*1400*620).reshape((380,1400,620))
start = time.time()
forcing.cumsum(axis=0)
np_time = time.time() - start
print(np_time)
2.085033416748047
Upvotes: 1
Views: 3131
Reputation: 155497
As mentioned by daniel451, numpy
isn't parallelizing the cumsum
operation, so you can parallelize it explicitly to gain at least a little performance.
For example, using multiprocessing.dummy
(a thread-backed copy of the multiprocessing
API), you could do:
import numpy as np
from multiprocessing.dummy import Pool
from itertools import repeat
forcing = np.random.rand(380*1400*620).reshape((380,1400,620))
# Make an output array of matching size, that can be populated piecemeal
# in each thread
forceres = np.zeros_like(forcing)
# Compute cumsum in parallel over second dimension
with Pool() as pool:
# Use module function with np.rollaxis to avoid need to define
# worker to do slicing
pool.starmap(np.cumsum, zip(np.rollaxis(forcing, 1), repeat(0), repeat(None), np.rollaxis(forceres, 1)))
I tested this with ipython3
's %time
/%%time
magic on an eight core machine, and found it reduced runtime vs. the original code by almost 70%, from 5.49 seconds to 1.73 seconds; your machine is clearly faster, so if the same speed up occurs on your machine I'd expect it to take ~0.66 seconds.
My comparison was:
>>> %%time
... forcesres = np.zeros_like(forcing)
... with Pool() as pool:
... pool.starmap(np.cumsum, zip(np.rollaxis(forcing, 1), repeat(0), repeat(None), np.rollaxis(forceres, 1)))
CPU times: user 10 s, sys: 213 ms, total: 10.2 s
Wall time: 1.73 s
vs.
>>> %time forcing.cumsum(axis=0); None # ; None avoids output
CPU times: user 5.27 s, sys: 218 ms, total: 5.49 s
Wall time: 5.49 s
Upvotes: 3