саша
саша

Reputation: 531

Speed-up cython code

I have code that is working in python and want to use cython to speed up the calculation. The function that I've copied is in a .pyx file and gets called from my python code. V, C, train, I_k are 2-d numpy arrays and lambda_u, user, hidden are ints. I don't have any experience in using C or cython. What is an efficient way to make this code faster. Using cython -a for compiling shows me that the code is flawed but how can I improve it. Using for i in prange (user_size, nogil=True): results in Constructing Python slice object not allowed without gil.

How has the code to be modified to harvest the power of cython?

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def u_update(V, C, train, I_k, lambda_u, user, hidden):
    cdef int user_size = user
    cdef int hidden_dim = hidden
    cdef np.ndarray U = np.empty((hidden_dim,user_size), float)
    cdef int m = C.shape[1]

    for i in range(user_size):
        C_i = np.zeros((m, m), dtype=float)
        for j in range(m):
            C_i[j,j]=C[i,j]

        U[:,i] = np.dot(np.linalg.inv(np.dot(V, np.dot(C_i,V.T)) + lambda_u*I_k), np.dot(V, np.dot(C_i,train[i,:].T)))

    return U

Upvotes: 3

Views: 1096

Answers (2)

hpaulj
hpaulj

Reputation: 231665

You are trying to use cython by diving into the deep end of pool. You should start with something small, such as some of the numpy examples. Or even try to improve on np.diag.

    i = 0
    C_i = np.zeros((m, m), dtype=float)
    for j in range(m):
        C_i[j,j]=C[i,j]

v.

    C_i = diag(C[i,:])

Can you improve the speed of this simple expression? diag is not compiled, but it does perform an efficient indexed assignment.

 res[:n-k].flat[i::n+1] = v

But the real problem for cython is this expression:

U[:,i] = np.dot(np.linalg.inv(np.dot(V, np.dot(C_i,V.T)) + lambda_u*I_k), np.dot(V, np.dot(C_i,train[i,:].T)))

np.dot is compiled. cython won't turn that in to c code, nor will it consolidate all 5 dots into one expression. It also won't touch the inv. So at best cython will speed up the iteration wrapper, but it will still call this Python expression m times.

My guess is that this expression can be cleaned up. Replacing the inner dots with einsum can probably eliminate the need for C_i. The inv might make 'vectorizing' the whole thing difficult. But I'd have to study it more.

But if you want to stick with the cython route, you need to transform that U expression into simple iterative code, without calls to numpy functions like dot and inv.

===================

I believe the following are equivalent:

np.dot(C_i,V.T)
C[i,:,None]*V.T

In:

np.dot(C_i,train[i,:].T) 

if train is 2d, then train[i,:] is 1d, and the .T does nothing.

In [289]: np.dot(np.diag([1,2,3]),np.arange(3))
Out[289]: array([0, 2, 6])
In [290]: np.array([1,2,3])*np.arange(3)
Out[290]: array([0, 2, 6])

If I got that right, you don't need C_i.

======================

Furthermore, these calculations can be moved outside the loop, with expressions like (not tested)

CV1 = C[:,:,None]*V.T   # a 3d array
CV2 = C * train.T  

for i in range(user_size):
    U[:,i] = np.dot(np.linalg.inv(np.dot(V, CV1[i,...]) + lambda_u*I_k), np.dot(V, CV2[i,...]))

A further step is to move both np.dot(V,CV...) out of the loop. That may require np.matmul (@) or np.einsum. Then we will have

for i...
    I = np.linalg.inv(VCV1[i,...])  
    U[:,i] = np.dot(I+ lambda_u), VCV2[i,])

or even

for i...
     I[...i] = np.linalg.inv(...) # if inv can't be vectorized
U = np.einsum(..., I+lambda_u, VCV2)

This is a rough sketch, and details will need to be worked out.

Upvotes: 3

NILobachevsky
NILobachevsky

Reputation: 138

The first thing that comes to mind is you haven't typed the function arguments and specified the data type and number of dimensions like so :

def u_update(np.ndarray[np.float64, ndim=2]V, np.ndarray[np.float64, ndim=2]\
C, np.ndarray[np.float64, ndim=2] train, np.ndarray[np.float64, ndim=2] \
I_k, int lambda_u, int user, int hidden) :

This will greatly speed up indexing with 2 indices like you do in the inner loop.

It's best to do this to the array U as well, although you are using slicing:

cdef np.ndarray[np.float64, ndim=2] U = np.empty((hidden_dim,user_size), np.float64)

Next, you are redefining C_i, a large 2-D array every iteration of the outer loop. Also, you have not supplied any type information for it, which is a must if Cython is to offer any speedup. To fix this :

cdef np.ndarray[np.float64, ndim=2] C_i = np.zeros((m, m), dtype=np.float64)
    for i in range(user_size):
        C_i.fill(0)

Here, we have defined it once (with type information), and reused the memory by filling with zeros instead of calling np.zeros() to make a new array every time.

Also, you might want to turn off bounds checking only after you have finished debugging.

If you need speedups in the U[:,i]=... step, you could consider writing another function with Cython to perform those operations using loops.

Do read this tutorial which should give you an idea of what to do when working with Numpy arrays in Cython and what not to do as well, and also to appreciate how much of a speedup you can get with these simple changes.

Upvotes: 2

Related Questions