Reputation: 947
I have some python code that currently runs too slowly for it to be useful. Having done some speed tests, the bulk of the time seems to be spent performing mathematical operations to calculate corr (see code below).
import numpy as np
from multiprocessing import Pool
from contextlib import closing
def calc_cc(tlag):
global phi_t, psi_t, T
tstart = tlag-T/2
x = np.arange(tstart, tstart+T, dtype=int)
#bulk of time spent running line below calculating corr
corr = np.sum(np.abs(np.cos(phi_t[tlag+x]-psi_t[x]))-np.abs(np.sin(phi_t[tlag+x]-psi_t[x])))
def main():
global phi_t, psi_t, T
phi_t = #some large numpy array
psi_t = #another numpy array
tlag = #another numpy array
with closing(Pool(processes=5)) as p:
corr = p.map(calc_cc,tlag)
p.terminate()
if __name__ == "__main__":
main()
I have spent some time looking around online to find ways to optimise my python code, and more specifically mathematical operations, but a lot of the advise stems around using packages such as numpy instead of python base functions where required (for example using np.sum vastly increases the speed when compared to sum(). Having now got to a point where I am using numpy arrays, with numpy methods, I am not sure where else I can find additional gains. The formula I am effectively trying to solve is:
I am sure there must be a better way of doing this, so any guidance towards a better solution would be greatly appreciated :).
Upvotes: 1
Views: 1877
Reputation: 35146
Firstly, you can improve your code slightly by computing phi_t(...)-psi_t(...)
once for both trigonometric functions as @hpaulj noted. However, the biggest issue I see is that you're using a multiprocessing pool for a comparatively light calculation. My guess would be that large part of your runtime comes from overhead by using the pool.
In case you can fit in memory, you can compute your whole correlation function by making use of array broadcasting. By vectorizing your calculation, you'll maximize the speed-up you can get via numpy. The idea is to use 2d arrays indexing your phi
/psi
arrays: columns correspond to the original x
indices, and rows correspond to individual tlag
shifts. The built-in numpy functions you're using naturally make use of broadcasting.
Here's what I mean, complete with some dummy data:
import numpy as np
def calc_cc_new(tlags):
global phi_t, psi_t, T
tstarts = tlags - T//2
xs = tstarts + np.arange(T)[:,None]
dphipsi = phi_t[tlags+xs] - psi_t[xs]
corr = np.sum(np.abs(np.cos(dphipsi)) - np.abs(np.sin(dphipsi)),axis=0)
return corr
def main_new():
global phi_t, psi_t, tlags
return calc_cc_new(tlags)
N = 10000
T = 100
phi_t = np.random.rand(N)*2*np.pi
psi_t = np.random.rand(N)*2*np.pi
tlags = np.arange(T//2,3*T)
if __name__ == "__main__":
print(main_new())
I didn't check your multiprocessing version, but the serial version of your original took 8.1 ms while the vectorized version took 2.1 ms (whether or not I separated the two terms of the summation into separate calls to np.sum()
). I checked that the returned array from my version checks out with the original using np.allclose
(due to vectorization the order of arithmetic operations is moved around, so we can't expect exact agreement, only one within machine precision). With the dummy example it didn't matter whether I defined the summation dimension to be the zeroth or the first, but depending on your input array shapes this might lead to some performance differences.
Furthermore, use of globals are usually discouraged, and global namespace lookup is marginally slower than a local one (though I don't expect this to matter in case your function is CPU-heavy). Anyway, if you ditch the multiprocessing pool, you can simply extend the definition of your calc_cc
function so that it gets all its inputs as parameters, rather than globals.
Finally, don't forget that 1/N
prefactor once you're finished.
Upvotes: 2