Reputation: 135
The question is simple: here is my current algorithm. This is terribly slow because of the loops on the arrays. Is there a way to change it in order to avoid the loops and take advantage of the NumPy arrays types ?
import numpy as np
def loopingFunction(listOfVector1, listOfVector2):
resultArray = []
for vector1 in listOfVector1:
result = 0
for vector2 in listOfVector2:
result += np.dot(vector1, vector2) * vector2[2]
resultArray.append(result)
return np.array(resultArray)
listOfVector1x = np.linspace(0,0.33,1000)
listOfVector1y = np.linspace(0.33,0.66,1000)
listOfVector1z = np.linspace(0.66,1,1000)
listOfVector1 = np.column_stack((listOfVector1x, listOfVector1y, listOfVector1z))
listOfVector2x = np.linspace(0.33,0.66,1000)
listOfVector2y = np.linspace(0.66,1,1000)
listOfVector2z = np.linspace(0, 0.33, 1000)
listOfVector2 = np.column_stack((listOfVector2x, listOfVector2y, listOfVector2z))
result = loopingFunction(listOfVector1, listOfVector2)
I am supposed to deal with really big arrays, that have way more than 1000 vectors in each. So if you have any advice, I'll take it.
Upvotes: 3
Views: 1039
Reputation: 7736
Avoid nested loops and adjust the calculation order, which is 20 times faster than the optimized np.einsum
and nearly 400_000 times faster than the original program:
>>> out = listOfVector1.dot(listOfVector2[:, 2].dot(listOfVector2))
>>> np.allclose(out, loopingFunction(listOfVector1, listOfVector2))
True
Test:
>>> timeit(lambda: loopingFunction(listOfVector1, listOfVector2), number=1)
1.4389081999834161
>>> timeit(lambda: listOfVector1.dot(listOfVector2[:, 2].dot(listOfVector2)), number=400_000)
1.3162514999858104
>>> timeit(lambda: np.einsum('ij, kj, k->i', listOfVector1, listOfVector2, listOfVector2[:, 2], optimize=['einsum_path', (1, 2), (0, 1)]), number=18_000)
1.3501517999975476
Upvotes: 3
Reputation: 50358
Just for fun, I wrote an optimized Numba implementation that outperform all others. It is based on the einsum
optimization of the @MichaelSzczesny answer.
import numpy as np
import numba as nb
# This decorator ask Numba to eagerly compile the code using
# the provided signature string (containing the parameter types).
@nb.njit('(float64[:,::1], float64[:,::1])')
def loopingFunction_numba(listOfVector1, listOfVector2):
n, m = listOfVector1.shape
assert m == 3
result = np.empty(n)
s1 = s2 = s3 = 0.0
for i in range(n):
factor = listOfVector2[i, 2]
s1 += listOfVector2[i, 0] * factor
s2 += listOfVector2[i, 1] * factor
s3 += listOfVector2[i, 2] * factor
for i in range(n):
result[i] = listOfVector1[i, 0] * s1 + listOfVector1[i, 1] * s2 + listOfVector1[i, 2] * s3
return result
result = loopingFunction_numba(listOfVector1, listOfVector2)
Here are timings on my i5-9600KF processor:
Initial: 1052.0 ms
ymmx: 5.121 ms
MichaelSzczesny: 75.40 us
MechanicPig: 3.36 us
Numba: 2.74 us
Optimal lower bound: 0.66 us
This solution is ~384_000 times faster than the original one. Note that is does not even use the SIMD instructions of the processor that would result in a ~4x speed up on my machine. This is only possible by having transposed input that are much more SIMD-friendly than the current one. Transposition may also speed up other answers like the one of MechanicPig since BLAS can often benefit from this. The resulting code would reach the symbolic 1_000_000 speed up factor!
Upvotes: 6
Reputation: 5036
The obligatory np.einsum
benchmark
r2 = np.einsum('ij, kj, k->i', listOfVector1, listOfVector2, listOfVector2[:,2], optimize=['einsum_path', (1, 2), (0, 1)])
#%timeit result: 10000 loops, best of 5: 116 µs per loop
np.testing.assert_allclose(result, r2)
Upvotes: 8
Reputation: 4967
You can at least remove the two forloop to save alot of time, use matrix computation directly
import time
import numpy as np
def loopingFunction(listOfVector1, listOfVector2):
resultArray = []
for vector1 in listOfVector1:
result = 0
for vector2 in listOfVector2:
result += np.dot(vector1, vector2) * vector2[2]
resultArray.append(result)
return np.array(resultArray)
def loopingFunction2(listOfVector1, listOfVector2):
resultArray = np.sum(np.dot(listOfVector1, listOfVector2.T) * listOfVector2[:,2], axis=1)
return resultArray
listOfVector1x = np.linspace(0,0.33,1000)
listOfVector1y = np.linspace(0.33,0.66,1000)
listOfVector1z = np.linspace(0.66,1,1000)
listOfVector1 = np.column_stack((listOfVector1x, listOfVector1y, listOfVector1z))
listOfVector2x = np.linspace(0.33,0.66,1000)
listOfVector2y = np.linspace(0.66,1,1000)
listOfVector2z = np.linspace(0, 0.33, 1000)
listOfVector2 = np.column_stack((listOfVector2x, listOfVector2y, listOfVector2z))
import time
t0 = time.time()
result = loopingFunction(listOfVector1, listOfVector2)
print('time old version',time.time() - t0)
t0 = time.time()
result2 = loopingFunction2(listOfVector1, listOfVector2)
print('time matrix computation version',time.time() - t0)
print('Are results are the same',np.allclose(result,result2))
Which gives
time old version 1.174513578414917
time matrix computation version 0.011968612670898438
Are results are the same True
Basically, the less loop the better.
Upvotes: 4