Reputation: 720
My code contains a for loop with large number of iterations. Within the loop I need to so matrix multiplication and matrix inverse (normally a matrix of size 12 x 12). My loop needs to run 120,000 times and I am currently getting a speed of 14s, which is relatively very high compared to MATLAB (1s) and FORTRAN (0.4s). Below is the function I am trying to optimize:
def fixed_simulator(ref, xg, yg, dt, ndiv, ijk, nst, smx, skx, cdx, smy, sky, cdy):
gamma = 0.5
beta = 1.0/6.0
knx = skx + (gamma/beta/dt)*cdx + (1.0/beta/np.power(dt,2))*smx
dx1 = np.ones((nst,1), dtype=float)*0.0
vx1 = np.ones((nst,1), dtype=float)*0.0
px1 = np.diag(-1.0*smx).reshape(nst,1)*xg[0]
ax1 = np.matmul(linalg.inv(smx), px1 - np.matmul(cdx, vx1) - np.matmul(skx, dx1))
# I = np.ones((nst,1), dtype=float)
dx2 = np.zeros((nst,1), dtype=float)
vx2 = np.zeros((nst,1), dtype=float)
px2 = np.zeros((nst,1), dtype=float)
ax2 = np.zeros((nst,1), dtype=float)
na1x = (1.0/beta/np.power(dt,2))*smx + (gamma/beta/dt)*cdx
na2x = (1.0/beta/dt)*smx + (gamma/beta - 1.0)*cdx
na3x = (1.0/2.0/beta - 1.0)*smx + (gamma*dt/2.0/beta - dt)*cdx
print(len(xg))
# -----> Below is the loop that's taking long time.
for i in range(1,len(xg)):
px2 = np.diag(-1.0*smx).reshape(nst,1)*xg[i]
pcx1 = px2 + np.matmul(na1x, dx1) + np.matmul(na2x, vx1) + np.matmul(na3x, ax1)
dx2 = np.matmul(np.linalg.inv(smx), pcx1)
vx2 = (gamma/beta/dt)*(dx2 - dx1) + (1.0 - gamma/beta)*vx1 + dt*(1.0 - gamma/2.0/beta)*ax1
ax2 = np.matmul(np.linalg.inv(smx), px2 - np.matmul(cdx, vx2) - np.matmul(skx, dx2))
dx1, vx1, px1, ax1 = dx2, vx2, px2, ax2
Most of the time seems to be going in calculating the inverse and multiplication part.
Numpy Configuration in the system:
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
library_dirs = ['C:\\projects\\numpy-wheels\\numpy\\build\\openblas']
libraries = ['openblas']
language = f77
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
library_dirs = ['C:\\projects\\numpy-wheels\\numpy\\build\\openblas']
libraries = ['openblas']
language = f77
define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
library_dirs = ['C:\\projects\\numpy-wheels\\numpy\\build\\openblas']
libraries = ['openblas']
language = f77
define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
library_dirs = ['C:\\projects\\numpy-wheels\\numpy\\build\\openblas']
libraries = ['openblas']
language = f77
define_macros = [('HAVE_CBLAS', None)]
cProfile Result
2157895 function calls in 2.519 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.474 1.474 2.519 2.519 C:\Users\Naseef\OneDrive\04AllPhDPrograms\mhps\mhps\fixed.py:154(fixed_simulator)
839163 0.556 0.000 0.556 0.000 {built-in method numpy.core.multiarray.matmul}
119881 0.105 0.000 0.439 0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\lib\twodim_base.py:197(diag)
119881 0.083 0.000 0.256 0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\core\fromnumeric.py:1294(diagonal)
239762 0.049 0.000 0.107 0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\core\numeric.py:504(asanyarray)
119881 0.103 0.000 0.103 0.000 {method 'diagonal' of 'numpy.ndarray' objects}
239763 0.058 0.000 0.058 0.000 {built-in method numpy.core.multiarray.array}
119881 0.049 0.000 0.049 0.000 {method 'reshape' of 'numpy.ndarray' objects}
119881 0.022 0.000 0.022 0.000 {built-in method builtins.isinstance}
239764 0.019 0.000 0.019 0.000 {built-in method builtins.len}
1 0.000 0.000 0.000 0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\linalg\linalg.py:468(inv)
2 0.000 0.000 0.000 0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\core\numeric.py:156(ones)
1 0.000 0.000 0.000 0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\linalg\linalg.py:141(_commonType)
2 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.empty}
1 0.000 0.000 0.000 0.000 {built-in method builtins.print}
2 0.000 0.000 0.000 0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\progressbar\utils.py:28(write)
1 0.000 0.000 0.000 0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\linalg\linalg.py:108(_makearray)
2 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.copyto}
4 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.zeros}
2 0.000 0.000 0.000 0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\progressbar\bar.py:547(update)
1 0.000 0.000 0.000 0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\linalg\linalg.py:126(_realType)
1 0.000 0.000 0.000 0.000 {method 'astype' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\core\numeric.py:433(asarray)
2 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}
2 0.000 0.000 0.000 0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\linalg\linalg.py:113(isComplexType)
1 0.000 0.000 0.000 0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\linalg\linalg.py:103(get_linalg_error_extobj)
1 0.000 0.000 0.000 0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\linalg\linalg.py:200(_assertRankAtLeast2)
1 0.000 0.000 0.000 0.000 C:\Users\Naseef\AppData\Local\Programs\Python\Python36-32\lib\site-packages\numpy\linalg\linalg.py:211(_assertNdSquareness)
2 0.000 0.000 0.000 0.000 {built-in method time.perf_counter}
3 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass}
1 0.000 0.000 0.000 0.000 {method '__array_prepare__' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
Upvotes: 0
Views: 2190
Reputation: 3147
There are two steps I usually take before trying to speed up numpy code.
The test cases should be something simple and quick running, but still reflective of real data. You've provided profiling, but no test cases so what follows is going to be some untested guess work. Looking over the test cases with largest running time it's clear that the running time comes from the loop, and mostly matrix operations in it.
119881 0.049 0.000 0.049 0.000 {method 'reshape' of 'numpy.ndarray' objects} 239763 0.058 0.000 0.058 0.000 {built-in method numpy.core.multiarray.array} 119881 0.103 0.000 0.103 0.000 {method 'diagonal' of 'numpy.ndarray' objects} 239762 0.049 0.000 0.107 0.000 ...\core\numeric.py:504(asanyarray) 119881 0.083 0.000 0.256 0.000 ...\core\fromnumeric.py:1294(diagonal) 119881 0.105 0.000 0.439 0.000 ...\lib\twodim_base.py:197(diag) 839163 0.556 0.000 0.556 0.000 {built-in method numpy.core.multiarray.matmul}
First oddity is that np.linalg.inv(smx)
doesn't show up in the slow operations.
I think you misunderstood a commenters advice and moved it out of the main loop entirely.
It should still be in main loop, but only called a single time.
for i in range(1,len(xg)):
....
smxinv = np.linalg.inv(smx) ## Calculate inverse once per loop
dx2 = np.matmul(smxinv, pcx1)
...
ax2 = np.matmul(smxinv, px2 - np.matmul(cdx, vx2) - np.matmul(skx, dx2))
...
The slowest operation is matmul
.
This isn't too surprising - it's called seven times in the main loop.
Each call appears to have unique arguments, so I don't see any easy way to speed that up.
Next is diag
and diagonal
.
These create a diagonal array, which has mostly zero entries so moving the creation outside the loop and only updating non-zero entries should provide a speedup.
## Pre allocate px2 array (may not have a large effect)
px2 = np.diag(1).reshape(nst,1)
px2i = where(px2) ## Setup index of non-zero entries
for i in range(1,len(xg)):
px2[px2i] = -smx*xg[i] ## This should be equivalent
...
That also removes the call to reshape. You could also precompute some constants and avoid a few calculations each loop, but this probably won't have a large effect on the overall runtime.
Each of these steps need to be run against a test case to make sure they don't change the function's behaviour, then profiled to see how much (if any) improvement is provided. While I'd expect you can get it under four or five seconds, Python won't be able to match the performance of compiled languages.
Upvotes: 1