Balfar
Balfar

Reputation: 135

What is the most efficient way to deal with a loop on NumPy arrays?

The question is simple: here is my current algorithm. This is terribly slow because of the loops on the arrays. Is there a way to change it in order to avoid the loops and take advantage of the NumPy arrays types ?

import numpy as np

def loopingFunction(listOfVector1, listOfVector2):
    resultArray = []

    for vector1 in listOfVector1:
        result = 0

        for vector2 in listOfVector2:
            result += np.dot(vector1, vector2) * vector2[2]

        resultArray.append(result)

    return np.array(resultArray)

listOfVector1x = np.linspace(0,0.33,1000)
listOfVector1y = np.linspace(0.33,0.66,1000)
listOfVector1z = np.linspace(0.66,1,1000)

listOfVector1 = np.column_stack((listOfVector1x, listOfVector1y, listOfVector1z))

listOfVector2x = np.linspace(0.33,0.66,1000)
listOfVector2y = np.linspace(0.66,1,1000)
listOfVector2z = np.linspace(0, 0.33, 1000)

listOfVector2 = np.column_stack((listOfVector2x, listOfVector2y, listOfVector2z))

result = loopingFunction(listOfVector1, listOfVector2)

I am supposed to deal with really big arrays, that have way more than 1000 vectors in each. So if you have any advice, I'll take it.

Upvotes: 3

Views: 1039

Answers (4)

Mechanic Pig
Mechanic Pig

Reputation: 7736

Avoid nested loops and adjust the calculation order, which is 20 times faster than the optimized np.einsum and nearly 400_000 times faster than the original program:

>>> out = listOfVector1.dot(listOfVector2[:, 2].dot(listOfVector2))
>>> np.allclose(out, loopingFunction(listOfVector1, listOfVector2))
True

Test:

>>> timeit(lambda: loopingFunction(listOfVector1, listOfVector2), number=1)
1.4389081999834161
>>> timeit(lambda: listOfVector1.dot(listOfVector2[:, 2].dot(listOfVector2)), number=400_000)
1.3162514999858104
>>> timeit(lambda: np.einsum('ij, kj, k->i', listOfVector1, listOfVector2, listOfVector2[:, 2], optimize=['einsum_path', (1, 2), (0, 1)]), number=18_000)
1.3501517999975476

Upvotes: 3

Jérôme Richard
Jérôme Richard

Reputation: 50358

Just for fun, I wrote an optimized Numba implementation that outperform all others. It is based on the einsum optimization of the @MichaelSzczesny answer.

import numpy as np
import numba as nb

# This decorator ask Numba to eagerly compile the code using 
# the provided signature string (containing the parameter types).
@nb.njit('(float64[:,::1], float64[:,::1])')
def loopingFunction_numba(listOfVector1, listOfVector2):
    n, m = listOfVector1.shape
    assert m == 3

    result = np.empty(n)
    s1 = s2 = s3 = 0.0

    for i in range(n):
        factor = listOfVector2[i, 2]
        s1 += listOfVector2[i, 0] * factor
        s2 += listOfVector2[i, 1] * factor
        s3 += listOfVector2[i, 2] * factor

    for i in range(n):
        result[i] = listOfVector1[i, 0] * s1 + listOfVector1[i, 1] * s2 + listOfVector1[i, 2] * s3

    return result

result = loopingFunction_numba(listOfVector1, listOfVector2)

Here are timings on my i5-9600KF processor:

Initial:          1052.0 ms
ymmx:                5.121 ms
MichaelSzczesny:        75.40 us
MechanicPig:             3.36 us
Numba:                   2.74 us
Optimal lower bound:     0.66 us

This solution is ~384_000 times faster than the original one. Note that is does not even use the SIMD instructions of the processor that would result in a ~4x speed up on my machine. This is only possible by having transposed input that are much more SIMD-friendly than the current one. Transposition may also speed up other answers like the one of MechanicPig since BLAS can often benefit from this. The resulting code would reach the symbolic 1_000_000 speed up factor!

Upvotes: 6

Michael Szczesny
Michael Szczesny

Reputation: 5036

The obligatory np.einsum benchmark

r2 = np.einsum('ij, kj, k->i', listOfVector1, listOfVector2, listOfVector2[:,2], optimize=['einsum_path', (1, 2), (0, 1)])
#%timeit result: 10000 loops, best of 5: 116 µs per loop

np.testing.assert_allclose(result, r2)

Upvotes: 8

ymmx
ymmx

Reputation: 4967

You can at least remove the two forloop to save alot of time, use matrix computation directly

import time

import numpy as np

def loopingFunction(listOfVector1, listOfVector2):
    resultArray = []

    for vector1 in listOfVector1:
        result = 0

        for vector2 in listOfVector2:
            result += np.dot(vector1, vector2) * vector2[2]

        resultArray.append(result)

    return np.array(resultArray)

def loopingFunction2(listOfVector1, listOfVector2):
    resultArray = np.sum(np.dot(listOfVector1, listOfVector2.T) * listOfVector2[:,2], axis=1)

    return resultArray

listOfVector1x = np.linspace(0,0.33,1000)
listOfVector1y = np.linspace(0.33,0.66,1000)
listOfVector1z = np.linspace(0.66,1,1000)

listOfVector1 = np.column_stack((listOfVector1x, listOfVector1y, listOfVector1z))

listOfVector2x = np.linspace(0.33,0.66,1000)
listOfVector2y = np.linspace(0.66,1,1000)
listOfVector2z = np.linspace(0, 0.33, 1000)

listOfVector2 = np.column_stack((listOfVector2x, listOfVector2y, listOfVector2z))
import time
t0 = time.time()
result = loopingFunction(listOfVector1, listOfVector2)
print('time old version',time.time() - t0)
t0 = time.time()
result2 = loopingFunction2(listOfVector1, listOfVector2)
print('time matrix computation version',time.time() - t0)
print('Are results are the same',np.allclose(result,result2))

Which gives

time old version 1.174513578414917
time matrix computation version 0.011968612670898438
Are results are the same True

Basically, the less loop the better.

Upvotes: 4

Related Questions