Reputation:
How can I speed up the function least_square
? We have six variables (3 orientation angles and 3 axis translation) that need to be optimized. We apply two sets of 3D points, two sets of points on the plane and a projection matrix to the input of the function.
dSeed = np.zeros(6)
optRes = least_squares(minReproj, dSeed, method='lm', max_nfev=600,
args=(points_prev, points_cur, d3d_prev, d3d_cur, Proj1))
This function minimizes the error of forward-backward projection of points.
def minReproj(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix):
Rmat = genEulerZXZ(dof[0], dof[1], dof[2]) # my function
translationArray = np.array([[dof[3]], [dof[4]], [dof[5]]])
temp = np.hstack((Rmat, translationArray))
perspectiveProj = np.vstack((temp, [0, 0, 0, 1]))
numPoints = d2dPoints1.shape[0]
errorA = np.zeros((numPoints,3))
errorB = np.zeros((numPoints,3))
forwardProj = np.matmul(w2cMatrix, perspectiveProj)
backwardProj = np.matmul(w2cMatrix, np.linalg.inv(perspectiveProj))
for i in range(numPoints):
Ja = np.ones((3))
Jb = np.ones((3))
Wa = np.ones((4))
Wb = np.ones((4))
Ja[0:2] = d2dPoints1[i,:]
Jb[0:2] = d2dPoints2[i,:]
Wa[0:3] = d3dPoints1[i,:]
Wb[0:3] = d3dPoints2[i,:]
JaPred = np.matmul(forwardProj, Wb)
JaPred /= JaPred[-1]
e1 = Ja - JaPred
JbPred = np.matmul(backwardProj, Wa)
JbPred /= JbPred[-1]
e2 = Jb - JbPred
errorA[i,:] = e1
errorB[i,:] = e2
residual = np.vstack((errorA,errorB))
return residual.flatten()
def genEulerZXZ(psi, theta, sigma):
c1 = cos(psi)
s1 = sin(psi)
c2 = cos(theta)
s2 = sin(theta)
c3 = cos(sigma)
s3 = sin(sigma)
mat = np.zeros((3,3))
mat[0,0] = (c1 * c3) - (s1 * c2 * s3)
mat[0,1] = (-c1 * s3) - (s1 * c2 * c3)
mat[0,2] = (s1 * s2)
mat[1,0] = (s1 * c3) + (c1 * c2 * s3)
mat[1,1] = (-s1 * s3) + (c1 * c2 * c3)
mat[1,2] = (-c1 * s2)
mat[2,0] = (s2 * s3)
mat[2,1] = (s2 * c3)
mat[2,2] = c2
return mat
This optimization takes 0.2 to 0.4 seconds, and this is too much. Maybe you know how to speed up this process? Or maybe there is another way to find the relative rotation and translation of two point sets? For rpoleski:
96 0.023 0.000 19.406 0.202 /usr/local/lib/python3.7/dist-packages/scipy/optimize/_lsq/least_squares.py:240(least_squares)
4548 0.132 0.000 18.986 0.004 /usr/local/lib/python3.7/dist-packages/scipy/optimize/_lsq/least_squares.py:801(fun_wrapped)
96 0.012 0.000 18.797 0.196 /usr/local/lib/python3.7/dist-packages/scipy/optimize/_lsq/least_squares.py:42(call_minpack)
4548 11.102 0.002 18.789 0.004 /home/pi/helperFunctions.py:29(minimizeReprojection)
Upvotes: 0
Views: 2044
Reputation: 26886
Most likely, during the scipy.optimize.least_squares()
, the largest portion of the time is spent in computing the residuals, which in your case take the form of the code in minReproj()
.
However, the code you provided in minReproj()
seems to have a sub-optimal memory management, which could be greatly improved with pre-allocation. This would result in significant speed gains.
For example, vstack()
and hstack()
suffer a significant penalty from having to copy their arguments into the memory of their final result. Consider, the first few lines of minReproj()
, which I regrouped in gen_affine_OP()
. These can be rewritten as gen_affine()
with much better timings (I have also rewritten gen_euler_zxz()
to not allocate new memory):
import numpy as np
from math import sin, cos
def gen_euler_zxz(result, psi, theta, sigma):
c1 = cos(psi)
s1 = sin(psi)
c2 = cos(theta)
s2 = sin(theta)
c3 = cos(sigma)
s3 = sin(sigma)
result[0,0] = (c1 * c3) - (s1 * c2 * s3)
result[0,1] = (-c1 * s3) - (s1 * c2 * c3)
result[0,2] = (s1 * s2)
result[1,0] = (s1 * c3) + (c1 * c2 * s3)
result[1,1] = (-s1 * s3) + (c1 * c2 * c3)
result[1,2] = (-c1 * s2)
result[2,0] = (s2 * s3)
result[2,1] = (s2 * c3)
result[2,2] = c2
return result
def gen_affine(dof):
result = np.zeros((4, 4))
gen_euler_zxz(result[:3, :3], dof[0], dof[1], dof[2])
result[:3, 3] = dof[3:]
result[3, 3] = 1
return result
def gen_affine_OP(dof):
Rmat = gen_euler_zxz(np.empty((3, 3)), dof[0], dof[1], dof[2])
translationArray = np.array([[dof[3]], [dof[4]], [dof[5]]])
temp = np.hstack((Rmat, translationArray))
return np.vstack((temp, [0, 0, 0, 1]))
dof = 1, 2, 3, 4, 5, 6
%timeit gen_affine_OP(dof)
# 100000 loops, best of 3: 16.6 µs per loop
%timeit gen_affine(dof)
# 100000 loops, best of 3: 5.02 µs per loop
Similarly, the residual = np.vstack((errorA,errorB))
call can be avoided, by defining a larger residual and providing a view on it for errorA
and errorB
.
Another example is when creating Ja
, Jb
, Wa
, Wb
:
def func_OP(numPoints):
for i in range(numPoints):
Ja = np.ones((3))
Jb = np.ones((3))
Wa = np.ones((4))
Wb = np.ones((4))
def func(n):
Ja = np.empty(3)
Jb = np.empty(3)
Wa = np.empty(3)
Wb = np.empty(3)
for i in range(n):
Ja[:] = 1
Jb[:] = 1
Wa[:] = 1
Wb[:] = 1
%timeit func_OP(1000)
# 100 loops, best of 3: 9.31 ms per loop
%timeit func(1000)
# 100 loops, best of 3: 2.2 ms per loop
Also, .flatten()
is making a copy you do not need, while .ravel()
will just return a view, but that is enough for your needs and comes up much faster:
a = np.ones((300, 300))
%timeit a.ravel()
# 1000000 loops, best of 3: 215 ns per loop
%timeit a.flatten()
# 10000 loops, best of 3: 113 µs per loop
The final remarks concerns the speeding up of your main loop. I do not devise a simple vectorized rewriting for that, but you can speed things up with Numba (as long as it compiles in non-object mode).
To do so, you need to also JIT-decorate gen_euler_zxz()
and you need to replace np.matmul()
calls with np.dot()
(because np.matmul()
is not supported by Numba.
Eventually, the revised minReproj()
would look like:
from math import sin, cos
import numpy as np
import numba as nb
matmul = np.dot
@nb.jit
def gen_euler_zxz(result, psi, theta, sigma):
c1 = cos(psi)
s1 = sin(psi)
c2 = cos(theta)
s2 = sin(theta)
c3 = cos(sigma)
s3 = sin(sigma)
result[0, 0] = (c1 * c3) - (s1 * c2 * s3)
result[0, 1] = (-c1 * s3) - (s1 * c2 * c3)
result[0, 2] = (s1 * s2)
result[1, 0] = (s1 * c3) + (c1 * c2 * s3)
result[1, 1] = (-s1 * s3) + (c1 * c2 * c3)
result[1, 2] = (-c1 * s2)
result[2, 0] = (s2 * s3)
result[2, 1] = (s2 * c3)
result[2, 2] = c2
return result
@nb.jit
def minReproj(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix):
perspectiveProj = np.zeros((4, 4))
gen_euler_zxz(perspectiveProj[:3, :3], dof[0], dof[1], dof[2])
perspectiveProj[:3, 3] = dof[3:]
perspectiveProj[3, 3] = 1
numPoints = d2dPoints1.shape[0]
residual = np.empty((2 * numPoints, 3))
forwardProj = matmul(w2cMatrix, perspectiveProj)
backwardProj = matmul(w2cMatrix, np.linalg.inv(perspectiveProj))
Ja = np.empty(3)
Jb = np.empty(3)
Wa = np.empty(4)
Wb = np.empty(4)
for i in range(numPoints):
j = i + numPoints
Ja[2] = Jb[2] = 1.0
Wa[3] = Wb[3] = 1.0
Ja[0:2] = d2dPoints1[i, :]
Jb[0:2] = d2dPoints2[i, :]
Wa[0:3] = d3dPoints1[i, :]
Wb[0:3] = d3dPoints2[i, :]
JaPred = matmul(forwardProj, Wb)
JaPred /= JaPred[-1]
residual[i, :] = Ja - JaPred
JbPred = matmul(backwardProj, Wa)
JbPred /= JbPred[-1]
residual[j, :] = Jb - JbPred
return residual.ravel()
Testing it on some toy data shows significant speed-ups, which become drammatic when using Numba appropriately:
n = 10000
dof = 1, 2, 3, 4, 5, 6
d2dPoints1 = np.random.random((n, 2))
d2dPoints2 = np.random.random((n, 2))
d3dPoints1 = np.random.random((n, 3))
d3dPoints2 = np.random.random((n, 3))
w2cMatrix = np.random.random((3, 4))
x = minReproj_OP(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix)
y = minReproj(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix)
z = minReproj_nb(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix)
print(np.allclose(x, y))
# True
print(np.allclose(x, z))
# True
%timeit minReproj_OP(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix)
# 1 loop, best of 3: 277 ms per loop
%timeit minReproj(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix)
# 10 loops, best of 3: 177 ms per loop
%timeit minReproj_nb(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix)
# 100 loops, best of 3: 5.69 ms per loop
Upvotes: 3