Reputation: 149
I have a function that needs to be evaluated at least 1e6 times. It only requires two arrays as input.
import numpy as np
np.random.seed(1)
res = np.random.rand(0,1,1000)
f1 = lambda r, sig: -0.5*(np.sum(2*np.log(sig) + np.log(2*np.pi) + (r/sig)**2))
Currently,
%timeit f1(res,np.sqrt(res))
clocks in at 26 usec. Is there any means I can have faster evaluation calls/execution time?
Any means will do; e.g. not using lambda function, using native math instead of numpy, or any tricks involving manipulation of equation using faster math operation implemented in python.
EDIT:
I made a silly mistake of producing an empty res
array. So it should be
res=np.random.rand(1000)
Also the second array np.sqrt(res)
is actually independent from res
and typically consists of values much smaller than res
but I used sqrt(res)
just for illustration.
So rewriting the code more clearly
import numpy as np
np.random.seed(1)
res1 = np.random.rand(1000)
res2 = np.sqrt(res) #not true in general but res2<res1
f1 = lambda r, sig: -0.5*(np.sum(2*np.log(sig) + np.log(2*np.pi) + (r/sig)**2))
And now %f1(res1,res2)
clocks in at 60 usec in my machine.
Upvotes: 2
Views: 519
Reputation: 53029
I'm pretty sure the most expensive thing are all that log
s, and those you can save (i.e. call once instead of sig.size
- in this example = 1000 - times). The underlying identity is log(a) + log(b) = log(a*b)
. Gives me a speedup by a factor of three.
Btw. I changed random.rand(0, 1, 1000)
which produces an empty array to random.rand(1000)
. Also I take it sig=sqrt(res)
is not a general assumption, so I didn't try and exploit that.
I do a few other optimizations but their effect is nowhere near as massive as getting rid of the log
s. 1) precompute log(2*pi)
2) use np.dot
for the last term.
As @JPdL points out this method is vulnerable to over- and underflow. One can trade various amounts of performance to guard against that using chunking, see f4. f5. f6
below.
import numpy as np
np.random.seed(1)
res = np.random.rand(1000)
f1 = lambda r, sig: -0.5*(np.sum(2*np.log(sig) + np.log(2*np.pi) + (r/sig)**2))
def f2(r, sig):
return -0.5*(np.sum(2*np.log(sig) + np.log(2*np.pi) + (r/sig)**2))
# gung ho
def f3(r, sig, precomp=np.log(2*np.pi)):
rs = r/sig
return -np.log(np.prod(sig)) - 0.5*(r.size * precomp + np.dot(rs, rs))
# chunk and hope for the best
def f4(r, sig, chnk=32, precomp=np.log(2*np.pi)):
rs = r/sig
sumlogsig = np.log(np.multiply.reduceat(sig, np.arange(0, len(r), chnk))).sum()
return -sumlogsig - 0.5*(r.size * precomp + np.dot(rs, rs))
# chunk and check for extreme values
def f5(r, sig, chnk=32,
precomp=np.log(2*np.pi), precomp2=np.exp([-8, 8])):
rs = r/sig
bad = np.where((sig<precomp2[0]) | (sig>precomp2[1]))[0]
sumlogsig = np.log(sig[bad]).sum()
keep = sig[bad]
sig[bad] = 1
sumlogsig += np.log(np.multiply.reduceat(sig, np.arange(0, len(r), chnk))).sum()
sig[bad] = keep
return -sumlogsig - 0.5*(r.size * precomp + np.dot(rs, rs))
# chunk and try to be extra clever
def f6(r, sig, chnk=512,
precomp=np.log(2*np.pi), precomp2=np.exp(np.arange(-18, 19))):
binidx = np.digitize(sig, precomp2[1::2])<<1
rs = r/sig
sig = sig*precomp2[36 - binidx]
bad = np.where((binidx==0) | (binidx==36))[0]
sumlogsig = binidx.sum() - 18*r.size + np.log(sig[bad]).sum()
sig[bad] = 1
sumlogsig += np.log(np.multiply.reduceat(sig, np.arange(0, len(r), chnk))).sum()
return -sumlogsig - 0.5*(r.size * precomp + np.dot(rs, rs))
from timeit import timeit
sr = np.sqrt(res)
print(timeit('f1(res,sr)', number=100, globals={'res':res, 'np':np, 'sr':sr, 'f1':f1}))
print(timeit('f2(res,sr)', number=100, globals={'res':res, 'np':np, 'sr':sr, 'f2':f2}))
print(timeit('f3(res,sr)', number=100, globals={'res':res, 'np':np, 'sr':sr, 'f3':f3}))
print(timeit('f4(res,sr)', number=100, globals={'res':res, 'np':np, 'sr':sr, 'f4':f4}))
print(timeit('f5(res,sr)', number=100, globals={'res':res, 'np':np, 'sr':sr, 'f5':f5}))
print(timeit('f6(res,sr)', number=100, globals={'res':res, 'np':np, 'sr':sr, 'f6':f6}))
print(f1(res,np.sqrt(res)))
print(f2(res,np.sqrt(res)))
print(f3(res,np.sqrt(res)))
print(f4(res,np.sqrt(res)))
print(f5(res,np.sqrt(res)))
print(f6(res,np.sqrt(res)))
Sample output:
0.004246247990522534
0.00418912700843066
0.0011273059935774654
0.0013386670034378767
0.0022679700050503016
0.004274581006029621
-662.250886322
-662.250886322
-662.250886322
-662.250886322
-662.250886322
-662.250886322
Upvotes: 1
Reputation: 40693
You're doing implicit copies with your lambda. You can gain a performance boost by using in-place variants of your function where possible. http://ipython-books.github.io/featured-01/
def f2(r, sig):
x = np.log(sig)
x *= 2 # in-place operation
x += np.log(2*np.pi) # in-place operation
y = r / sig
y **= 2 # in-place operation
x += y # in-place operation
return -0.5 * x.sum()
Also, check the output of
import numpy.distutils.system_info as sysinfo
sysinfo.show_all()
...
openblas_lapack_info:
libraries openblas not found in ['C:\\Users\\User\\AppData\\Local\\Programs\\Python\\Python35-32\\lib', 'C:\\', 'C:\\Users\\User\\AppData\\Local\\Programs\\Python\\Python35-32\\libs']
NOT AVAILABLE
If blas, lapack and atlas are NOT AVAILABLE
then you are not getting the most out of numpy. Not by a long shot. These libraries enable numpy to use SIMD
operations (Single Instruction Multiple Data -- a form of parallelisation).
Upvotes: 1
Reputation: 3194
Actually, you can skip np.sum()
, because you call it on a single argument that is obtained by applying +
operator. Besides, np.log(2*np.pi)
is constant and won't depend on lambda args, so you can calculate it outside lambda and reuse. Also, that probably won't change much, but check whether a**2
is faster or slower than a*a
; if the latter is faster, calculate d = r/sig
once and use d*d
instead of (r/sig)**2
. You can also check whether 0.5*x
is faster or slower than x/2.0
. This aint much, but it may give you speedup of an usec or two.
And now, some math. After suggestions above you get:
log_of_pi = np.log(2*np.pi)
f1 = lambda r, sig: -0.5*(2*np.log(sig) + log_of_pi + (r/sig)**2))
f1(res,np.sqrt(res))
So basically, you calculate:
c1*(2*log(sqrt(r)) + c2 + (r/sqrt(r))**2)
First of all:
log(sqrt(r)) = log(r**(1/2)) = 1/2*log(r)
Also:
(r/sqrt(r))**2 = sqrt(r)**2 = abs(r)
So basically, you're trying to calculate:
c1*(2*log(sqrt(r)) + c2 + (r/sqrt(r))**2) = c1*(2*1/2*log(r) + c2 + abs(r)) = c1*(log(r) + c2 + abs(r))
So, you can skip expensive square of division (last part of sum) and calculating square root of r
(done before calling f1
).
Of course that was done assuming that you'll always use sqrt(r)
as sig
.
PS. Respect to grovina, who answered with the same train of thoughts 15 seconds before me, and provided measurements too. Would you want to accept my answer, accept the other one, as it is more complete.
Upvotes: 1
Reputation: 3077
For reference, f1
the way you've defined gives me:
10000 loops, best of 3: 20.9 µs per loop
A simple optimisation to avoid computing np.log(2*np.pi)
every the time:
log2pi = np.log(2*np.pi)
f2 = lambda r, sig: -0.5*(np.sum(2*np.log(sig) + log2pi + (r/sig)**2))
which reduces my timeit
output in approximately 15%:
100000 loops, best of 3: 17.8 µs per loop
Also, it is usually faster to compute x*x
than x**2
. With this approach, you can avoid the 2*
operation too.
def f3(r, sig):
sig2 = sig * sig
return -0.5*(np.sum(np.log(sig2) + log2pi + (r*r/sig2)))
100000 loops, best of 3: 15 µs per loop
Furthermore, depending on how often you are calling this function with arguments res, np.sqrt(res)
, you could definitely consider simplifying it to:
f4 = lambda a: -0.5*np.sum(np.log(a) + log2pi + a)
You can check that f4(res) == f1(res, np.sqrt(res))
and that it goes significantly faster too (around 44%).
100000 loops, best of 3: 11.7 µs per loop
Upvotes: 2