Reputation: 81
Doing a few tests, I noticed that a custom loop function I wrote could be faster than NumPy built-in function np.cumprod(). I didn't think it was possible, could someone explain what happens?
import time
import numpy as np
def custom_cumprod(x):
xb = [1]
for xi in x:
xb.append(xb[-1]*xi)
return np.array(xb[1:])
x = np.random.random((10000, 10000))
t0 = time.time()
b = custom_cumprod(x)
t1 = time.time()
print(t1-t0)
t0 = time.time()
c = np.cumprod(x,0)
t1 = time.time()
print(t1-t0)
print(np.sum(c-b))
Thanks!
Upvotes: 0
Views: 273
Reputation: 13419
The issue is that the computation runs along the outer dimension and numpy doesn't seem to optimize this case. By looping along this dimension, values are not consecutive in memory, resulting in cache and TLB misses.
This can be fixed by storing the array in Fortran order (np.asfortranarray
), or transposing it and running the computation along the inner dimension.
perf shows the difference clearly:
perf stat -e LLC-load-misses,dTLB-load-misses python3 -c \
'import numpy as np; np.cumprod(np.random.random((10000, 10000)), axis=0)'
Performance counter stats for 'python3 -c import numpy as np; np.cumprod(np.random.random((10000, 10000)), axis=0)':
22.254.639 LLC-load-misses:u
114.772.180 dTLB-load-misses:u
3,200727804 seconds time elapsed
2,148422000 seconds user
1,044664000 seconds sys
perf stat -e LLC-load-misses,dTLB-load-misses python3 -c \
'import numpy as np; np.cumprod(np.random.random((10000, 10000)), axis=1)'
Performance counter stats for 'python3 -c import numpy as np; np.cumprod(np.random.random((10000, 10000)), axis=1)':
633.959 LLC-load-misses:u
2.485.725 dTLB-load-misses:u
2,276527365 seconds time elapsed
0,947946000 seconds user
1,316536000 seconds sys
Ideally numpy should run the computation the same way you did in your custom version.
Upvotes: 2