Pierre Vermot
Pierre Vermot

Reputation: 81

Why is np.cumprod() slower than my own function for large arrays?

Doing a few tests, I noticed that a custom loop function I wrote could be faster than NumPy built-in function np.cumprod(). I didn't think it was possible, could someone explain what happens?

import time
import numpy as np

def custom_cumprod(x):
    xb = [1]
    for xi in x:
        xb.append(xb[-1]*xi)
    return np.array(xb[1:])

x = np.random.random((10000, 10000))

t0 = time.time()
b = custom_cumprod(x)
t1 = time.time()
print(t1-t0)
t0 = time.time()
c = np.cumprod(x,0)
t1 = time.time()
print(t1-t0)
print(np.sum(c-b))

Thanks!

Upvotes: 0

Views: 273

Answers (1)

Homer512
Homer512

Reputation: 13419

The issue is that the computation runs along the outer dimension and numpy doesn't seem to optimize this case. By looping along this dimension, values are not consecutive in memory, resulting in cache and TLB misses.

This can be fixed by storing the array in Fortran order (np.asfortranarray), or transposing it and running the computation along the inner dimension.

perf shows the difference clearly:

perf stat -e LLC-load-misses,dTLB-load-misses python3 -c \
    'import numpy as np; np.cumprod(np.random.random((10000, 10000)), axis=0)'

 Performance counter stats for 'python3 -c import numpy as np; np.cumprod(np.random.random((10000, 10000)), axis=0)':

        22.254.639      LLC-load-misses:u                                           
       114.772.180      dTLB-load-misses:u                                          

       3,200727804 seconds time elapsed

       2,148422000 seconds user
       1,044664000 seconds sys

perf stat -e LLC-load-misses,dTLB-load-misses python3 -c \
    'import numpy as np; np.cumprod(np.random.random((10000, 10000)), axis=1)'

 Performance counter stats for 'python3 -c import numpy as np; np.cumprod(np.random.random((10000, 10000)), axis=1)':

           633.959      LLC-load-misses:u                                           
         2.485.725      dTLB-load-misses:u                                          

       2,276527365 seconds time elapsed

       0,947946000 seconds user
       1,316536000 seconds sys

Ideally numpy should run the computation the same way you did in your custom version.

Upvotes: 2

Related Questions