Reputation: 67
I am looking for speeding up following python numpy codes:
def fun_np(m,data):
a, b, c = data[:,0], data[:,1], data[:,2]
M = len(data[:,0])
n = round((m+1)*(m+2)*(m+3)/6)
u =np.zeros((M,n))
C = 0
for i in range(0,m+1):
for j in range(0,i+1):
for k in range(0,j+1):
if ((i-j)!=0):
u[:,C] = (j-k)*(a)**(i-j)*(b)**(j-k-1)*(c)**k
C=C+1
return u
corresponding cython codes are as follows:
%%cython
import numpy as np
cimport numpy as np
from cython import wraparound, boundscheck, nonecheck
@boundscheck(False)
@wraparound(False)
@nonecheck(False)
cpdef fun_cyt(int m,np.ndarray[np.float64_t, ndim=2] data):
cdef:
np.ndarray[np.float64_t, ndim=1] a = data[:,0]
np.ndarray[np.float64_t, ndim=1] b = data[:,1]
np.ndarray[np.float64_t, ndim=1] c = data[:,2]
int M, n
Py_ssize_t i, j, k, s
M = len(data[:,0])
n = round((m+1)*(m+2)*(m+3)/6)
cdef np.ndarray[np.float64_t, ndim=2] u = np.zeros((M,n), dtype=np.float64)
cdef int C = 0
for i in range(m+1): #range(0,m+1):
for j in range(i+1):
for k in range(j+1):
for s in range(M):
if (i-j)!=0:
u[s,C] = (j-k)*(a[s])**(i-j)*(b[s])**(j-k-1)*(c[s])**k
C=C+1
return u
Here are timings
z = np.random.randn(6000, 3); m=20;
%timeit fun_np(m,z);
result: 1.97 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit fun_cyt(m,z);
result: 1.91 s ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As you can see there is not a significance speed between numpy and cython codes. I would appreciate if you can help to optimize the cython codes if possible.
Annotated html of cython codes html
Upvotes: 0
Views: 319
Reputation: 93
Very interesting example! Most of the operations are on the 6000 element-vectors. Cython cannot really be faster than numpy when it comes to large vector power, multiplication and addition. You may be as fast as numpy by implementing this in Cython, and maybe even gain 10% to 20% by removing some overhead of numpy.
However, there are other ways to speed this calculation up. The vector operations are operations on the three columns of your data vector, and you write into the columns of your output vector. By default, numpy arrays have Row-major ordering, that is in memory the rows are contiguous in memory. For the operations done here, this is bad. Further reading: https://en.wikipedia.org/wiki/Row-_and_column-major_order.
The two functions are basically the same, they would be identical if the creation of the output vector would happen outside of the function.
Note the following: I replaced u[:,C] = ... with u[:,C] +=, because otherwise the result is only defined by k=j and thus always 0. I do not know what the point of these calculations is, but that is probably not it.
import numpy as np
def fun_np(m,data):
a, b, c = data[:,0], data[:,1], data[:,2]
M = len(data[:,0])
n = round((m+1)*(m+2)*(m+3)/6)
u = np.zeros((M,n))
C = 0
for i in range(0,m+1):
for j in range(0,i+1):
for k in range(0,j+1):
if ((i-j)!=0):
u[:,C] += (j-k)*(a)**(i-j)*(b)**(j-k-1)*(c)**k
C=C+1
return u
def fun_npF(m,data):
a, b, c = data[:,0], data[:,1], data[:,2]
M = len(data[:,0])
n = round((m+1)*(m+2)*(m+3)/6)
u = np.zeros((M,n),order='F')
C = 0
for i in range(0,m+1):
for j in range(0,i+1):
for k in range(0,j+1):
if ((i-j)!=0):
u[:,C] += (j-k)*(a)**(i-j)*(b)**(j-k-1)*(c)**k
C=C+1
return u
z = np.random.randn(6000, 3); m=20;
print("Numpy Row-major")
%timeit fun_np(m,z);
# Fortran order, because vector operations on columns
print("Numpy Column-major")
zF = np.asarray(z.copy(),order='F')
%timeit fun_npT(m,zF);
# Check if output the same
diff = (max(np.ravel(abs(fun_np(m,z)-fun_npF(m,zF)))))
max_rm = (max(np.ravel(abs(fun_np(m,z)))))
max_cm = (max(np.ravel(abs(fun_npF(m,zF)))))
print("Dffference: %f, Max value Row-major: %f, Max value Column-major: %f"%(diff, max_rm, max_cm))
This gives me
Numpy Row-major
1.64 s ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Numpy Column-major
16 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Dffference: 0.000000, Max value Row-major: 196526643123.792450, Max value Column-major: 196526643123.792450
You can gain even more when thinking about where to put the "if" and combining this with Cython, but again only by some 10% to 20% I would guess.
Upvotes: 1
Reputation: 7157
As already mentioned in the comments, you could try it with numba. I would recommend further to parallelize the loops:
from numba import prange, jit
@jit(nopython=True, parallel=True)
def fun_numba(m,data):
a, b, c = data[:,0], data[:,1], data[:,2]
M = len(data[:,0])
n = round((m+1)*(m+2)*(m+3)/6)
u = np.zeros((M,n))
C = 0
for i in range(0,m+1):
for j in range(0,i+1):
for k in prange(0,j+1):
if ((i-j)!=0):
u[:,C] = (j-k)*(a)**(i-j)*(b)**(j-k-1)*(c)**k
C=C+1
return u
gives me on my machine:
In [11]: %timeit fun_np(m,z)
642 ms ± 4.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [12]: %timeit fun_numba(m,z)
101 ms ± 7.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 1