JPdL
JPdL

Reputation: 149

faster evaluation calls of a function

I have a function that needs to be evaluated at least 1e6 times. It only requires two arrays as input.

import numpy as np
np.random.seed(1)

res = np.random.rand(0,1,1000)
f1 = lambda r, sig: -0.5*(np.sum(2*np.log(sig) + np.log(2*np.pi) + (r/sig)**2))    

Currently,

%timeit f1(res,np.sqrt(res)) 

clocks in at 26 usec. Is there any means I can have faster evaluation calls/execution time?

Any means will do; e.g. not using lambda function, using native math instead of numpy, or any tricks involving manipulation of equation using faster math operation implemented in python.

EDIT:

I made a silly mistake of producing an empty res array. So it should be

res=np.random.rand(1000)

Also the second array np.sqrt(res) is actually independent from res and typically consists of values much smaller than res but I used sqrt(res) just for illustration.

So rewriting the code more clearly

import numpy as np
np.random.seed(1)

res1 = np.random.rand(1000)
res2 = np.sqrt(res) #not true in general but res2<res1
f1 = lambda r, sig: -0.5*(np.sum(2*np.log(sig) + np.log(2*np.pi) + (r/sig)**2)) 

And now %f1(res1,res2) clocks in at 60 usec in my machine.

Upvotes: 2

Views: 519

Answers (4)

Paul Panzer
Paul Panzer

Reputation: 53029

I'm pretty sure the most expensive thing are all that logs, and those you can save (i.e. call once instead of sig.size - in this example = 1000 - times). The underlying identity is log(a) + log(b) = log(a*b). Gives me a speedup by a factor of three.

Btw. I changed random.rand(0, 1, 1000) which produces an empty array to random.rand(1000). Also I take it sig=sqrt(res) is not a general assumption, so I didn't try and exploit that.

I do a few other optimizations but their effect is nowhere near as massive as getting rid of the logs. 1) precompute log(2*pi) 2) use np.dot for the last term.

Update:

As @JPdL points out this method is vulnerable to over- and underflow. One can trade various amounts of performance to guard against that using chunking, see f4. f5. f6 below.

import numpy as np
np.random.seed(1)

res = np.random.rand(1000)
f1 = lambda r, sig: -0.5*(np.sum(2*np.log(sig) + np.log(2*np.pi) + (r/sig)**2)) 

def f2(r, sig):
    return -0.5*(np.sum(2*np.log(sig) + np.log(2*np.pi) + (r/sig)**2))

# gung ho
def f3(r, sig, precomp=np.log(2*np.pi)):
    rs = r/sig
    return -np.log(np.prod(sig)) - 0.5*(r.size * precomp + np.dot(rs, rs))

# chunk and hope for the best
def f4(r, sig, chnk=32, precomp=np.log(2*np.pi)):
    rs = r/sig
    sumlogsig = np.log(np.multiply.reduceat(sig, np.arange(0, len(r), chnk))).sum()
    return -sumlogsig - 0.5*(r.size * precomp + np.dot(rs, rs))

# chunk and check for extreme values
def f5(r, sig, chnk=32,
       precomp=np.log(2*np.pi), precomp2=np.exp([-8, 8])):
    rs = r/sig
    bad = np.where((sig<precomp2[0]) | (sig>precomp2[1]))[0]
    sumlogsig = np.log(sig[bad]).sum()
    keep = sig[bad]
    sig[bad] = 1
    sumlogsig += np.log(np.multiply.reduceat(sig, np.arange(0, len(r), chnk))).sum()
    sig[bad] = keep
    return -sumlogsig - 0.5*(r.size * precomp + np.dot(rs, rs))

# chunk and try to be extra clever
def f6(r, sig, chnk=512,
       precomp=np.log(2*np.pi), precomp2=np.exp(np.arange(-18, 19))):
    binidx = np.digitize(sig, precomp2[1::2])<<1
    rs = r/sig
    sig = sig*precomp2[36 - binidx]
    bad = np.where((binidx==0) | (binidx==36))[0]
    sumlogsig = binidx.sum() - 18*r.size + np.log(sig[bad]).sum()
    sig[bad] = 1
    sumlogsig += np.log(np.multiply.reduceat(sig, np.arange(0, len(r), chnk))).sum()
    return -sumlogsig - 0.5*(r.size * precomp + np.dot(rs, rs))

from timeit import timeit
sr = np.sqrt(res)
print(timeit('f1(res,sr)', number=100, globals={'res':res, 'np':np, 'sr':sr, 'f1':f1}))
print(timeit('f2(res,sr)', number=100, globals={'res':res, 'np':np, 'sr':sr, 'f2':f2}))
print(timeit('f3(res,sr)', number=100, globals={'res':res, 'np':np, 'sr':sr, 'f3':f3}))
print(timeit('f4(res,sr)', number=100, globals={'res':res, 'np':np, 'sr':sr, 'f4':f4}))
print(timeit('f5(res,sr)', number=100, globals={'res':res, 'np':np, 'sr':sr, 'f5':f5}))
print(timeit('f6(res,sr)', number=100, globals={'res':res, 'np':np, 'sr':sr, 'f6':f6}))
print(f1(res,np.sqrt(res)))
print(f2(res,np.sqrt(res)))
print(f3(res,np.sqrt(res)))
print(f4(res,np.sqrt(res)))
print(f5(res,np.sqrt(res)))
print(f6(res,np.sqrt(res)))

Sample output:

0.004246247990522534
0.00418912700843066
0.0011273059935774654
0.0013386670034378767
0.0022679700050503016
0.004274581006029621
-662.250886322
-662.250886322
-662.250886322
-662.250886322
-662.250886322
-662.250886322

Upvotes: 1

Dunes
Dunes

Reputation: 40693

You're doing implicit copies with your lambda. You can gain a performance boost by using in-place variants of your function where possible. http://ipython-books.github.io/featured-01/

def f2(r, sig):
   x = np.log(sig)
   x *= 2  # in-place operation

   x += np.log(2*np.pi)  # in-place operation

   y = r / sig
   y **= 2  # in-place operation

   x += y  # in-place operation

   return -0.5 * x.sum()

Also, check the output of

import numpy.distutils.system_info as sysinfo
sysinfo.show_all()

...

openblas_lapack_info:
  libraries openblas not found in ['C:\\Users\\User\\AppData\\Local\\Programs\\Python\\Python35-32\\lib', 'C:\\', 'C:\\Users\\User\\AppData\\Local\\Programs\\Python\\Python35-32\\libs']
  NOT AVAILABLE

If blas, lapack and atlas are NOT AVAILABLE then you are not getting the most out of numpy. Not by a long shot. These libraries enable numpy to use SIMD operations (Single Instruction Multiple Data -- a form of parallelisation).

Upvotes: 1

Filip Malczak
Filip Malczak

Reputation: 3194

Actually, you can skip np.sum(), because you call it on a single argument that is obtained by applying + operator. Besides, np.log(2*np.pi) is constant and won't depend on lambda args, so you can calculate it outside lambda and reuse. Also, that probably won't change much, but check whether a**2 is faster or slower than a*a; if the latter is faster, calculate d = r/sig once and use d*d instead of (r/sig)**2. You can also check whether 0.5*x is faster or slower than x/2.0. This aint much, but it may give you speedup of an usec or two.

And now, some math. After suggestions above you get:

log_of_pi = np.log(2*np.pi)
f1 = lambda r, sig: -0.5*(2*np.log(sig) + log_of_pi + (r/sig)**2))
f1(res,np.sqrt(res)) 

So basically, you calculate:

c1*(2*log(sqrt(r)) + c2 + (r/sqrt(r))**2)

First of all:

log(sqrt(r)) = log(r**(1/2)) = 1/2*log(r)

Also:

(r/sqrt(r))**2 = sqrt(r)**2 = abs(r)

So basically, you're trying to calculate:

c1*(2*log(sqrt(r)) + c2 + (r/sqrt(r))**2) = c1*(2*1/2*log(r) + c2 + abs(r)) = c1*(log(r) + c2 + abs(r))

So, you can skip expensive square of division (last part of sum) and calculating square root of r (done before calling f1).

Of course that was done assuming that you'll always use sqrt(r) as sig.

PS. Respect to grovina, who answered with the same train of thoughts 15 seconds before me, and provided measurements too. Would you want to accept my answer, accept the other one, as it is more complete.

Upvotes: 1

grovina
grovina

Reputation: 3077

For reference, f1 the way you've defined gives me:

10000 loops, best of 3: 20.9 µs per loop

A simple optimisation to avoid computing np.log(2*np.pi) every the time:

log2pi = np.log(2*np.pi)
f2 = lambda r, sig: -0.5*(np.sum(2*np.log(sig) + log2pi + (r/sig)**2))

which reduces my timeit output in approximately 15%:

100000 loops, best of 3: 17.8 µs per loop

Also, it is usually faster to compute x*x than x**2. With this approach, you can avoid the 2* operation too.

def f3(r, sig):
   sig2 = sig * sig
   return -0.5*(np.sum(np.log(sig2) + log2pi + (r*r/sig2)))

100000 loops, best of 3: 15 µs per loop

Furthermore, depending on how often you are calling this function with arguments res, np.sqrt(res), you could definitely consider simplifying it to:

f4 = lambda a: -0.5*np.sum(np.log(a) + log2pi + a)

You can check that f4(res) == f1(res, np.sqrt(res)) and that it goes significantly faster too (around 44%).

100000 loops, best of 3: 11.7 µs per loop

Upvotes: 2

Related Questions