naseefo
naseefo

Reputation: 720

Python Optimization: How to speed up matrix inverse operation?

My code contains a for loop with large number of iterations. Within the loop I need to so matrix multiplication and matrix inverse (normally a matrix of size 12 x 12). My loop needs to run 120,000 times and I am currently getting a speed of 14s, which is relatively very high compared to MATLAB (1s) and FORTRAN (0.4s). Below is the function I am trying to optimize:

def fixed_simulator(ref, xg, yg, dt, ndiv, ijk, nst, smx, skx, cdx, smy, sky, cdy):

    gamma = 0.5
    beta = 1.0/6.0

    knx = skx + (gamma/beta/dt)*cdx + (1.0/beta/np.power(dt,2))*smx

    dx1 = np.ones((nst,1), dtype=float)*0.0
    vx1 = np.ones((nst,1), dtype=float)*0.0
    px1 = np.diag(-1.0*smx).reshape(nst,1)*xg[0]
    ax1 = np.matmul(linalg.inv(smx), px1 - np.matmul(cdx, vx1) - np.matmul(skx, dx1))

    # I = np.ones((nst,1), dtype=float)

    dx2 = np.zeros((nst,1), dtype=float)
    vx2 = np.zeros((nst,1), dtype=float)
    px2 = np.zeros((nst,1), dtype=float)
    ax2 = np.zeros((nst,1), dtype=float)

    na1x = (1.0/beta/np.power(dt,2))*smx + (gamma/beta/dt)*cdx
    na2x = (1.0/beta/dt)*smx + (gamma/beta - 1.0)*cdx
    na3x = (1.0/2.0/beta - 1.0)*smx + (gamma*dt/2.0/beta - dt)*cdx

    print(len(xg))

# -----> Below is the loop that's taking long time.  

    for i in range(1,len(xg)):

        px2 = np.diag(-1.0*smx).reshape(nst,1)*xg[i]

        pcx1 = px2 + np.matmul(na1x, dx1) + np.matmul(na2x, vx1) + np.matmul(na3x, ax1)

        dx2 =  np.matmul(np.linalg.inv(smx), pcx1)

        vx2 = (gamma/beta/dt)*(dx2 - dx1) + (1.0 - gamma/beta)*vx1 + dt*(1.0 - gamma/2.0/beta)*ax1

        ax2 = np.matmul(np.linalg.inv(smx), px2 - np.matmul(cdx, vx2) - np.matmul(skx, dx2))

        dx1, vx1, px1, ax1 = dx2, vx2, px2, ax2 

Most of the time seems to be going in calculating the inverse and multiplication part.

Numpy Configuration in the system:

blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    library_dirs = ['C:\\projects\\numpy-wheels\\numpy\\build\\openblas']
    libraries = ['openblas']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    library_dirs = ['C:\\projects\\numpy-wheels\\numpy\\build\\openblas']
    libraries = ['openblas']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    library_dirs = ['C:\\projects\\numpy-wheels\\numpy\\build\\openblas']
    libraries = ['openblas']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    library_dirs = ['C:\\projects\\numpy-wheels\\numpy\\build\\openblas']
    libraries = ['openblas']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]

cProfile Result

         2157895 function calls in 2.519 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.474    1.474    2.519    2.519 C:\Users\Naseef\OneDrive\04AllPhDPrograms\mhps\mhps\fixed.py:154(fixed_simulator)
   839163    0.556    0.000    0.556    0.000 {built-in method numpy.core.multiarray.matmul}
   119881    0.105    0.000    0.439    0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\lib\twodim_base.py:197(diag)
   119881    0.083    0.000    0.256    0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\core\fromnumeric.py:1294(diagonal)
   239762    0.049    0.000    0.107    0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\core\numeric.py:504(asanyarray)
   119881    0.103    0.000    0.103    0.000 {method 'diagonal' of 'numpy.ndarray' objects}
   239763    0.058    0.000    0.058    0.000 {built-in method numpy.core.multiarray.array}
   119881    0.049    0.000    0.049    0.000 {method 'reshape' of 'numpy.ndarray' objects}
   119881    0.022    0.000    0.022    0.000 {built-in method builtins.isinstance}
   239764    0.019    0.000    0.019    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\linalg\linalg.py:468(inv)
        2    0.000    0.000    0.000    0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\core\numeric.py:156(ones)
        1    0.000    0.000    0.000    0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\linalg\linalg.py:141(_commonType)
        2    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.empty}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        2    0.000    0.000    0.000    0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\progressbar\utils.py:28(write)
        1    0.000    0.000    0.000    0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\linalg\linalg.py:108(_makearray)
        2    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.copyto}
        4    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.zeros}
        2    0.000    0.000    0.000    0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\progressbar\bar.py:547(update)
        1    0.000    0.000    0.000    0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\linalg\linalg.py:126(_realType)
        1    0.000    0.000    0.000    0.000 {method 'astype' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\core\numeric.py:433(asarray)
        2    0.000    0.000    0.000    0.000 {method 'write' of '_io.StringIO' objects}
        2    0.000    0.000    0.000    0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\linalg\linalg.py:113(isComplexType)
        1    0.000    0.000    0.000    0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\linalg\linalg.py:103(get_linalg_error_extobj)
        1    0.000    0.000    0.000    0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\linalg\linalg.py:200(_assertRankAtLeast2)
        1    0.000    0.000    0.000    0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\linalg\linalg.py:211(_assertNdSquareness)
        2    0.000    0.000    0.000    0.000 {built-in method time.perf_counter}
        3    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        1    0.000    0.000    0.000    0.000 {method '__array_prepare__' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}

Upvotes: 0

Views: 2190

Answers (1)

user2699
user2699

Reputation: 3147

There are two steps I usually take before trying to speed up numpy code.

  1. Profile the code to find what is taking the most time
  2. Build a couple of test cases calling the code to verify that the optimized version still runs correctly

The test cases should be something simple and quick running, but still reflective of real data. You've provided profiling, but no test cases so what follows is going to be some untested guess work. Looking over the test cases with largest running time it's clear that the running time comes from the loop, and mostly matrix operations in it.

119881 0.049 0.000 0.049 0.000 {method 'reshape' of 'numpy.ndarray' objects} 239763 0.058 0.000 0.058 0.000 {built-in method numpy.core.multiarray.array} 119881 0.103 0.000 0.103 0.000 {method 'diagonal' of 'numpy.ndarray' objects} 239762 0.049 0.000 0.107 0.000 ...\core\numeric.py:504(asanyarray) 119881 0.083 0.000 0.256 0.000 ...\core\fromnumeric.py:1294(diagonal) 119881 0.105 0.000 0.439 0.000 ...\lib\twodim_base.py:197(diag) 839163 0.556 0.000 0.556 0.000 {built-in method numpy.core.multiarray.matmul}

First oddity is that np.linalg.inv(smx) doesn't show up in the slow operations. I think you misunderstood a commenters advice and moved it out of the main loop entirely. It should still be in main loop, but only called a single time.

for i in range(1,len(xg)):
    ....
    smxinv = np.linalg.inv(smx) ## Calculate inverse once per loop
    dx2 =  np.matmul(smxinv, pcx1)
...
ax2 = np.matmul(smxinv, px2 - np.matmul(cdx, vx2) - np.matmul(skx, dx2))
...

The slowest operation is matmul. This isn't too surprising - it's called seven times in the main loop. Each call appears to have unique arguments, so I don't see any easy way to speed that up. Next is diag and diagonal. These create a diagonal array, which has mostly zero entries so moving the creation outside the loop and only updating non-zero entries should provide a speedup.

##  Pre allocate px2 array (may not have a large effect)
px2 = np.diag(1).reshape(nst,1)
px2i = where(px2) ## Setup index of non-zero entries

for i in range(1,len(xg)):
    px2[px2i] = -smx*xg[i]  ## This should be equivalent
    ...

That also removes the call to reshape. You could also precompute some constants and avoid a few calculations each loop, but this probably won't have a large effect on the overall runtime.

Each of these steps need to be run against a test case to make sure they don't change the function's behaviour, then profiled to see how much (if any) improvement is provided. While I'd expect you can get it under four or five seconds, Python won't be able to match the performance of compiled languages.

Upvotes: 1

Related Questions