Dylan Solms
Dylan Solms

Reputation: 388

Optimize this existing 3D Numpy Matrix Multiplication

I have code some that has just been completed. It works as intended. I have opted to use dot in Numpy as, to my limited experience, it is faster than usual forms of matrix multiplication if you have BLAS installed on your system. However, you will note than I have had to transpose a lot of stuff. I'm note sure of this then actually outweighs the benefit of using dot.

I provide the mathematical equation as was found in the paper. Note that is a recursion

enter image description here

Here is my code implementation:

I provide the necessary components and their dimensions first

P = array([[0.73105858, 0.26894142],
           [0.26894142, 0.73105858]])  # shape (K,K)

B = array([[6.07061629e-09, 0.00000000e+00],
           [0.00000000e+00, 2.57640371e-10]])  # shape (K,K)

dP = array([[[ 0.19661193, -0.19661193],
             [ 0.        ,  0.        ]],

           [[ 0.        ,  0.        ],
            [ 0.19661193, -0.19661193]],

           [[ 0.        ,  0.        ],
            [ 0.        ,  0.        ]],

           [[ 0.        ,  0.        ],
            [ 0.        ,  0.        ]]])  # shape (L,K,K)

dB = array([[[ 0.00000000e+00,  0.00000000e+00],
             [ 0.00000000e+00,  0.00000000e+00]],

            [[ 0.00000000e+00,  0.00000000e+00],
             [ 0.00000000e+00,  0.00000000e+00]],

            [[-1.16721049e-09,  0.00000000e+00],
             [ 0.00000000e+00,  0.00000000e+00]],

            [[ 0.00000000e+00,  0.00000000e+00],
             [ 0.00000000e+00, -1.27824683e-09]]])  # shape (L,K,K)

a = array([[[ 7.60485178e-08,  5.73923956e-07]],

           [[-5.54100398e-09, -8.75213012e-08]],

           [[-1.25878643e-08, -1.48361081e-07]],

           [[-2.73494035e-08, -1.74585971e-07]]])  # shape (L,1,K)

alpha = array([[0.11594542, 0.88405458]])  # shape (1,K)

c = 1  # scalar

Here is the actual Numpy calculation. Note all the transpose use.

term1 = (a/c).dot(P).dot(B)
term2 = dP.transpose((0,2,1)).dot(alpha.T).transpose((0,2,1)).dot(B)
term3 = dB.dot(  P.T.dot(alpha.T) ).transpose((0,2,1))
a = term1 + term2 + term3

One then should get:

>>> a
array([[[ 1.38388584e-10, -5.87312190e-12]],

       [[ 1.05516813e-09, -4.47819530e-11]],

       [[-3.76451117e-10, -2.88160549e-17]],

       [[-4.06412069e-16, -8.65984406e-10]]])

Note that the shape of alpha as well as a has been chosen by me. These can be changed if you find it to provide superior performance.

I would like to point out that I think that the existing code is fast. Actually, very fast. However, I keep wondering if one could do better. Please do give it a go. I have profiled my code (which has a lot of Numpy broadcasting and vectorization) and this is not necessarily a bottleneck as it takes 23 micro seconds to evaluate on my very old machine. However, it is a single step of a recursion. This means that it is evaluated N times in a sequential manner. Hence, even the tiniest of gains would make a large difference for a large sequence.

Thank you for your time.

EDIT/UPDATE:

Thanks to @max9111, who suggested that I look at this question here, I have managed some Numba code that runs faster than the Numpy calculation for a. It takes 14 microseconds as opposed to the original 23.

Here it is:

import numba as nb
@nb.njit(fastmath=True,parallel=True,boundscheck=False)
def get_next_a(a,alpha,P,dP,B,dB,c):
    N,M,_ = dP.shape
    new_a = np.zeros((N,1,M),dtype=np.float64)
    new_a = np.zeros((N,1,M))
    entry = 0
    for idx in nb.prange(N):
        for i in range(M):
            for j in range(M):
                term1 =  a[idx,0,j]*P[j,i]*B[i,i]/c
                term2 = alpha[0,j]*dP[idx,j,i]*B[i,i] 
                term3 = alpha[0,j]*P[j,i]*dB[idx,i,i]
                entry += term1 + term2 + term3
            new_a[idx,0,i] = entry
            entry = 0
    return new_a

One will see that get_next_a returns the desired result. However, when I call it in a pure python function that contains Numpy the it complains. Here is a snippet of my actual code:

def forward_recursions(X,working_params):

#    P,dP,B,dB,pi,dpi = setup(X,working_params) 
    # Dummy Data and Parameters instead of setup
    working_params = np.random.uniform(0,2,size=100)
    X = np.random.uniform(0,1,size=1000)
    P = np.random.uniform(0,1,size=(10,10))
    norm = P.sum(axis=1)
    P = P/norm[:,None]
    dP = np.random.uniform(-1,1,size=(100,10,10))
    # We pretend that all 1000 of the 10 x 10 matrices 
    # only have diagonal entries
    B = np.random.uniform(0,1,size=(1000,10,10)) 
    dB = np.random.uniform(0,1,size=(1000,100,10,10))
    pi = np.random.uniform(0,1,size=10)
    norm = pi.sum()
    pi = (pi/norm).reshape(1,10)
    dpi = np.random.uniform(0,1,size=(100,1,10))

    T = len(X)
    N = len(working_params)
    M = np.int(np.sqrt(N))
    ones = np.ones((M,1))


    alpha = pi.dot(B[0])
    scale = alpha.dot(ones)
    alpha = alpha/scale
    ll = np.log(scale)
    a = dpi.dot(B[0]) + dB[0].dot(pi.T).transpose((0,2,1))
    for t in range(1,T):

        old_scale = scale
        alpha = alpha.dot(P).dot(B[t])
        scale = alpha.dot(ones)
        ll += np.log(scale)
        alpha = alpha/scale

        # HERE IS THE NUMBA FUNCTION

        a = get_next_a(a,alpha,P,dP,B[t],dB[t],old_scale)

    dll = a.dot(ones).reshape((N,1))/scale
    return ll,dll,a

I know that the inclusion of my own code depends on other functions that are not included and hence means that forward_recursions will not run. I just hope for it to perhaps give some perspective.

The error I get is

TypingError: Invalid use of Function(<built-in function iadd>) with argument(s) of type(s): (Literal[int](0), array(float64, 2d, C))
Known signatures:
 * (int64, int64) -> int64
 * (int64, uint64) -> int64
 * (uint64, int64) -> int64
 * (uint64, uint64) -> uint64
 * (float32, float32) -> float32
 * (float64, float64) -> float64
 * (complex64, complex64) -> complex64
 * (complex128, complex128) -> complex128
 * parameterized
In definition 0:
    All templates rejected with literals.
In definition 1:
    All templates rejected without literals.
In definition 2:
    All templates rejected with literals.
In definition 3:
    All templates rejected without literals.
In definition 4:
    All templates rejected with literals.
In definition 5:
    All templates rejected without literals.
In definition 6:
    All templates rejected with literals.
In definition 7:
    All templates rejected without literals.
In definition 8:
    All templates rejected with literals.
In definition 9:
    All templates rejected without literals.
In definition 10:
    All templates rejected with literals.
In definition 11:
    All templates rejected without literals.
In definition 12:
    All templates rejected with literals.
In definition 13:
    All templates rejected without literals.
In definition 14:
    All templates rejected with literals.
In definition 15:
    All templates rejected without literals.
This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: typing of intrinsic-call at <ipython-input-251-50e636317ef8> (13)

I do not understand what this means. Do you perhaps know how I can fix something like this. Thank you very much for your time.

Upvotes: 2

Views: 662

Answers (2)

Dylan Solms
Dylan Solms

Reputation: 388

I have decided to post an answer to my own question. I would like to thank @max9111 and @user3666197 for their kind help and advice. I have used what they have taught me in order to produce optimized code using Numba.

I had some struggles with Numba though. One will note that my optimized Numba version had trouble running in a Python function (as was pointed out in my edit/update section). I now understand why.

Numba is really awesome as it is really fast. Plus one has calculations that is very explicit as compared perhaps to Numpy. Numpy produces shorter code and for me is more mathematically concise to actually read. Furthermore, Numpy seamlessly works with native Python and as a result is very trouble free to code with. Personally, I think Numpy is the fastest way to produce working script that is guaranteed to run fast. Numba does not play that well as Numpy. In fact, it and Numpy don't get along very well. As an example, Numpy's beloved dot cannot be used in Numba.

Numba requires one to change one's coding style a bit. I would like to think one uses the functional paradigm of coding with Numba: a function has input and it output where the output may flow into another function via its input. In other word, if you want to deposit data into another function, you can't nest the data-producing function into the one that is going to process it. You have to produce the data and then feed it into the processing function via its arguments.

Nesting does work, however, but do realize that the function you are nesting must also be a Numba compiled function. My code did not work because setup was a nested Python function. Numba could not recognise it hence TypingError: Invalid use of Function(<built-in function iadd>). So, if you want to nest functions then you have to make sure all your functions are Numba functions. This is actually quite constraining and would take away from the beauty of scripting things fast in Python. As a result, that is why I mentioned not nesting functions in Numba because this gives you the freedom of not having to make all your functions Numba.

Why would one not want all your functions to be in Numba if it is so fast then? Simple, Numpy broadcasting and vectorization is scary fast Faster than Numba for certain applications such as computing a probability density function for many data points and parameters. Plus, you still want to use things like numpy.linalg.solve etc.

Here is my optimized version:

@nb.njit(fastmath=True,parallel=True,boundscheck=False)
def forward_recursions_numba(X,
                             P,
                             dP,
                             B,
                             dB,
                             pi,
                             dpi):

#    P,dP,B,dB,pi,dpi = setup(X,working_params) # does not work with numba
#    print(P)

    T = len(X)
    N,M,_ = dP.shape
#    ones = np.ones((M,1))


#    alpha = pi.dot(B[0])
#    scale = alpha.dot(ones)
    alpha = np.zeros((1,M))
    scale = 0
    for i in range(M):
        alpha[0,i] = pi[0,i]*B[0,i,i]
        scale += alpha[0,i]

    alpha = alpha/scale
    ll = np.log(scale)


#    a = dpi.dot(B[0]) + dB[0].dot(pi.T).transpose((0,2,1))
    a = np.zeros((N,1,M))
    for idx in range(N):
        for i in range(M):
            a[idx,0,i] = dpi[idx,0,i]*B[0,i,i] + pi[0,i]*dB[0,idx,i,i]

    for t in range(1,T):
        old_scale = scale
#        alpha = alpha.dot(P).dot(B[t])
#        scale = alpha.dot(ones)
        scale = 0
        alpha_new = np.zeros(alpha.shape)
        for i  in range(M):
            entry = 0
            for j in range(M):
                entry += alpha[0,j]*P[j,i]*B[t,i,i]
            alpha_new[0,i] = entry
            scale += alpha_new[0,i]

        ll += np.log(scale)
        alpha = alpha_new/scale

#        term1 = (a/old_scale).dot(P).dot(B[t])
#        term2 = dP.transpose((0,2,1)).dot(alpha.T).transpose((0,2,1)).dot(B[t])
#        term3 = dB[t].dot(  P.T.dot(alpha.T) ).transpose((0,2,1))
#        a = term1 + term2 + term3
        new_a = np.zeros((N,1,M))
        for idx in nb.prange(N):
            for i in range(M):
                entry = 0
                for j in range(M):
                    term1 =  a[idx,0,j]*P[j,i]*B[t,i,i]/old_scale
                    term2 = alpha[0,j]*dP[idx,j,i]*B[t,i,i] 
                    term3 = alpha[0,j]*P[j,i]*dB[t,idx,i,i]
                    entry += term1 + term2 + term3
                new_a[idx,0,i] = entry
        a = new_a

#    dll = a.dot(ones).reshape((N,1))/scale
    dll = np.zeros((N,1))
    for idx in nb.prange(N):
        dparam = 0
        for i in range(M):
            dparam += a[idx,0,i]
        dll[idx] = dparam/scale

    return ll,dll,a

If we runs it on some dummy data

X = np.random.uniform(0,1,size=1000)
P = np.random.uniform(0,1,size=(10,10))
norm = P.sum(axis=1)
P = P/norm[:,None]
dP = np.random.uniform(-1,1,size=(100,10,10))
# We pretend that all 1000 of the 10 x 10 matrices 
# only have diagonal entries
B = np.random.uniform(0,1,size=(1000,10,10)) 
dB = np.random.uniform(0,1,size=(1000,100,10,10))
pi = np.random.uniform(0,1,size=10)
norm = pi.sum()
pi = (pi/norm).reshape(1,10)
dpi = np.random.uniform(0,1,size=(100,1,10))

Let us time it:

>>> %timeit forward_recursions_numba(X,P,dP,B,dB,pi,dpi)
51.3 ms ± 389 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Let us compare it to the Numpy dot version.

>>> %timeit forward_recursions(X,P,dP,B,dB,pi,dpi)
271 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

NOTE: to get the Numpy version you can just re-comment the Numpy code, comment the Numba loops and comment the @nb.njit decorator out.

Using lscpu on my Lubuntu machine we can see that my specs are Intel(R) Core(TM)2 Quad CPU Q6700 @ 2.66GHz with 6 GB DDR2 RAM (800 MHz)

So my code has been dramatically optimized. Plus, I have learned a lot. Thank you very much everyone for your kind help and patience.

Upvotes: 0

user3666197
user3666197

Reputation: 1

Q : …if one could do better?

Your as-is code executes on my ( seems even older ) machine not in the posted ~ 23 [us], but ~ 45 [ms] for the first call and, enjoying all the pre-fetches into iCACHE and dCACHE hierarchies somewhere between ~77..1xx [us]:

>>> from zmq import Stopwatch; aClk = Stopwatch()
>>> import numpy as np
>>>
>>> aClk.start(); a = ( a / c ).dot( P ).dot( B ) + dP.transpose( ( 0, 2, 1) ).dot( alpha.T ).transpose( ( 0, 2, 1 ) ).dot( B ) + dB.dot(  P.T.dot( alpha.T ) ).transpose( ( 0, 2, 1 ) ); aClk.stop()
44679
>>> aClk.start(); a = ( a / c ).dot( P ).dot( B ) + dP.transpose( ( 0, 2, 1) ).dot( alpha.T ).transpose( ( 0, 2, 1 ) ).dot( B ) + dB.dot(  P.T.dot( alpha.T ) ).transpose( ( 0, 2, 1 ) ); aClk.stop()
149
>>> aClk.start(); a = ( a / c ).dot( P ).dot( B ) + dP.transpose( ( 0, 2, 1) ).dot( alpha.T ).transpose( ( 0, 2, 1 ) ).dot( B ) + dB.dot(  P.T.dot( alpha.T ) ).transpose( ( 0, 2, 1 ) ); aClk.stop()
113
>>> aClk.start(); a = ( a / c ).dot( P ).dot( B ) + dP.transpose( ( 0, 2, 1) ).dot( alpha.T ).transpose( ( 0, 2, 1 ) ).dot( B ) + dB.dot(  P.T.dot( alpha.T ) ).transpose( ( 0, 2, 1 ) ); aClk.stop()
128
>>> aClk.start(); a = ( a / c ).dot( P ).dot( B ) + dP.transpose( ( 0, 2, 1) ).dot( alpha.T ).transpose( ( 0, 2, 1 ) ).dot( B ) + dB.dot(  P.T.dot( alpha.T ) ).transpose( ( 0, 2, 1 ) ); aClk.stop()
82
>>> aClk.start(); a = ( a / c ).dot( P ).dot( B ) + dP.transpose( ( 0, 2, 1) ).dot( alpha.T ).transpose( ( 0, 2, 1 ) ).dot( B ) + dB.dot(  P.T.dot( alpha.T ) ).transpose( ( 0, 2, 1 ) ); aClk.stop()
100
>>> aClk.start(); a = ( a / c ).dot( P ).dot( B ) + dP.transpose( ( 0, 2, 1) ).dot( alpha.T ).transpose( ( 0, 2, 1 ) ).dot( B ) + dB.dot(  P.T.dot( alpha.T ) ).transpose( ( 0, 2, 1 ) ); aClk.stop()
77
>>> aClk.start(); a = ( a / c ).dot( P ).dot( B ) + dP.transpose( ( 0, 2, 1) ).dot( alpha.T ).transpose( ( 0, 2, 1 ) ).dot( B ) + dB.dot(  P.T.dot( alpha.T ) ).transpose( ( 0, 2, 1 ) ); aClk.stop()
97
>>> a
array([[[ 1.38387304e-10, -5.87323502e-12]],
       [[ 1.05516829e-09, -4.47819355e-11]],
       [[-3.76450816e-10, -2.60843400e-20]],
       [[-1.41384088e-18, -8.65984377e-10]]])

Interestingly, re-running the code many times, re-assigning the processing results back into a actually does not change the values in a:

>>> aClk.start(); a = ( a / c ).dot( P ).dot( B ) + dP.transpose( ( 0, 2, 1) ).dot( alpha.T ).transpose( ( 0, 2, 1 ) ).dot( B ) + dB.dot(  P.T.dot( alpha.T ) ).transpose( ( 0, 2, 1 ) ); aClk.stop()
97
>>> a
array([[[ 1.38387304e-10, -5.87323502e-12]],    
       [[ 1.05516829e-09, -4.47819355e-11]],
       [[-3.76450816e-10, -2.60843400e-20]],
       [[-1.41384088e-18, -8.65984377e-10]]])

This means, the code, as-is, does a lot of work so as to finally deliver an invariant value of a ( a re-produced identity, at a cost of spending ~ XY [us] doing that - you being the only one to decide, if that is ok with your target-application or not )


Remarks about a wished-to-have space for an improvement :

Well, given the N ~ 1E(3..6) and K ~ 10 and L ~ 100, there is not much to expect from any improvement efforts, sponsored to re-solve the ( so far an a's identity result ) wish to have improved performance.

The sought for improved target processing will repeat sequentially more than ~1,000x … less than ~ 1,000,000x, that means :

  • RAM-bound issues are not cardinal, as cache effect on the static parts, all having sizes of but a few [MB], will get reused from cache at shortest latencies possible
  • CPU-bound issues are already pre-solved inside the design & engineering of the numpy-tools ( harnessing SIMD-resources of CPU and vector-aligned striding tricks, where feasible - so not much to expect from user-level coding )

Last, but not least, one may comment of the "costs"-of-transposing - numpy does nothing else for having to transpose a matrix, but a change in the order of indexing - nothing else. If this might have some effect, it could be expected rather from reviewing the FORTRAN-type of ordering or C-language type of ordering of the underlying storage of data-cells into the physical RAM, yet at scales of but 1E1 x 1E1 ~ 1E2 x 1E1 x 1E1 at max, this puts this aspect become negligible and well masked by the in-cache nature of the processing with zero-writebacks or other performance related impacts.


RESULT :

Given all the facts & further observations above, the cheapest and the most reasonable step for indeed getting higher throughput of here defined computations, is to move onto a linearly-working degree-of-freedom here - the higher [GHz] CPU-chip, the better ( linear growth of performance here ) also having as large amount of the AVX-512 registers as possible and as large as possible L1i + L1d caches ( strategies of affinity-mapped avoidance of any other O/S noise are obvious for HPC-grade performance targets ) and there rely on the already smart numpy tools, fine-tuned for this mix of the CPU-resources for matrix processing ( if in a need to go beyond float64 IEEE-754 representation, another story starts ).

Do not expect a User-level code to do way better than this, numpy-native SIMD-aligned processing can and will deliver.

Assembly inline could, for the above given scales, get an edge, yet at an immense human-labour costs of having to craft and test such an ultimate, yet rather an arcane change in the concept of the solution. Kindly let me know, if the Market indeed demands doing such a step.

Upvotes: 1

Related Questions