Why is numpy slower than for loop

Question

Update: this feature is now in sciPy.stats.qmc.discrepancy and was ported to Cython and also parallelised.

I have a function using some for loops and I wanted to improve the speed using numpy. But this seems not to do the trick as the bumpy version appears to be 2 times slower. Here is the code:

import numpy as np
import itertools
import timeit

def func():
    sample = np.random.random_sample((100, 2))
    
    disc1 = 0
    disc2 = 0
    n_sample = len(sample)
    dim = sample.shape[1]

    for i in range(n_sample):
        prod = 1
        for k in range(dim):
            sub = np.abs(sample[i, k] - 0.5)
            prod *= 1 + 0.5 * sub - 0.5 * sub ** 2
    
        disc1 += prod

    for i, j in itertools.product(range(n_sample), range(n_sample)):
        prod = 1
        for k in range(dim):
            a = 0.5 * np.abs(sample[i, k] - 0.5)
            b = 0.5 * np.abs(sample[j, k] - 0.5)
            c = 0.5 * np.abs(sample[i, k] - sample[j, k])
            prod *= 1 + a + b - c
        disc2 += prod

    c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2


def func_numpy():
    sample = np.random.random_sample((100, 2))

    disc1 = 0
    disc2 = 0
    n_sample = len(sample)
    dim = sample.shape[1]

    disc1 = np.sum(np.prod(1 + 0.5 * np.abs(sample - 0.5) - 0.5 * np.abs(sample - 0.5) ** 2, axis=1))
    
    for i, j in itertools.product(range(n_sample), range(n_sample)):
        disc2 += np.prod(1 + 0.5 * np.abs(sample[i] - 0.5) + 0.5 * np.abs(sample[j] - 0.5) - 0.5 * np.abs(sample[i] - sample[j]))
    
    c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2


print('Normal function time: ' , timeit.repeat('func()', number=20, repeat=5, setup="from __main__ import func"))
print('numpy function time: ', timeit.repeat('func_numpy()', number=20, repeat=5, setup="from __main__ import func_numpy"))

The timing output is:

Normal function time:  [2.831496894999873, 2.832342429959681, 2.8009242500411347, 2.8075121529982425, 2.824807019031141]
numpy function time:  [5.154757721000351, 5.2011515340418555, 5.148996959964279, 5.095560318033677, 5.125199959962629]

What am I missing here? I know that the bottleneck is the itertools part because I have a 100x100x2 loop instead of a 100x2 loop before. Do you see another way to do that?

Divakar · Accepted Answer

With NumPy, one must look to vectorize things and we could certainly do so here.

Taking a closer look at the loop portion, we are iterating along the first axis of the input data samples twice with that loop startup :

for i, j in itertools.product(range(n_sample), range(n_sample)):

We could convert these iterations into vectorized operations, once we let broadcasting handle those.

Now, to have a fully vectorized solution, we would need a lot more memory space, specifically (N,N,M), where (N,M) is the shape of the input data.

Another noticeable aspect here is that at each iteration, we aren't doing a lot of work, as we are performing operation on each row and each row contains only 2 elements for the given sample. So, the idea that comes out is that we could run a loop along M, such that at each iteration, we would compute the prod and accumulate. Thus, for the given sample, its just two loop iterations.

Getting out of the loop, we would have the accumulated prod, that just needs summation for disc2 as the final output.

Here's an implementation to fulfil the above mentioned ideas -

prod_arr = 1
for i in range(sample.shape[1]):
    si = sample[:,i]
    prod_arr *= 1 + 0.5 * np.abs(si[:,None] - 0.5) + 0.5 * np.abs(si - 0.5) - \
                                    0.5 * np.abs(si[:,None] - si)
disc2 = prod_arr.sum()

Runtime test

The stripped down version of the loopy portion from the original approach and the modified versions as approaches are listed below :

def org_app(sample):
    disc2 = 0
    n_sample = len(sample)
    for i, j in itertools.product(range(n_sample), range(n_sample)):
        disc2 += np.prod(1 + 0.5 * np.abs(sample[i] - 0.5) + 0.5 * \
            np.abs(sample[j] - 0.5) - 0.5 * np.abs(sample[i] - sample[j]))
    return disc2


def mod_app(sample):
    prod_arr = 1
    for i in range(sample.shape[1]):
        si = sample[:,i]
        prod_arr *= 1 + 0.5 * np.abs(si[:,None] - 0.5) + 0.5 * np.abs(si - 0.5) - \
                                        0.5 * np.abs(si[:,None] - si)
    disc2 = prod_arr.sum()
    return disc2

Timings and verification -

In [10]: sample = np.random.random_sample((100, 2))

In [11]: org_app(sample)
Out[11]: 11934.878683659041

In [12]: mod_app(sample)
Out[12]: 11934.878683659068

In [14]: %timeit org_app(sample)
10 loops, best of 3: 84.4 ms per loop

In [15]: %timeit mod_app(sample)
10000 loops, best of 3: 94.6 µs per loop

About 900x speedup! Well this should be motivating enough, hopefully to look to vectorize things whenever possible.

Why is numpy slower than for loop

Answers (2)

Related Questions