Doing for loop computations faster

Question

I have a for loop doing some operation on the elements of an array. There are 1e5 elements in the array

import numpy as np
A=np.array([1,2,3,4..........100000)]
for i in range(0,len(A)):
       A[i]=(A[i]*2+A[i]*4)**(1/3)

I want to obtain parallelisation in the above code so that each execution of the for loop goes to a different core to make the code execution faster. I have a workstation with 48 cores. How to achieve this parallel processing in python? Please help.

ShadowRanger · Accepted Answer

Don't bother parallelizing just yet. Right now, you're taking no advantage of numpy vectorization; you may as well be using Python list (or maybe array.array) for all the benefit numpy is giving you.

Actually use the vectorization features, and the overhead should drop by several orders of magnitude:

import numpy as np
A = np.array([1,2,3,4..........100000])  # If this is actually the values you want, use np.arange(1, 100000+1) to speed it up
A = (A * 6) ** (1 / 3)

# If the result should truncate back to int64, not convert to doubles, cast back at the end
A = A.astype(np.int64)

(A * 6) ** (1 / 3) does the same work as the for loop did, but much faster (you could match the original code more closely with A = (A * 2 + A * 4) ** (1/3), but multiplying by 2 and 4 separately and adding them together is pointless when you could just multiply by 6 directly). The final (optional, depending on intent) line gets exact equivalent behavior of the original loop by truncating back to the original integer dtype.

Comparing performance with ipython %%timeit magic for a microbenchmark:

In [2]: %%timeit
   ...: A = np.arange(1, 100000+1)
   ...: for i in range(len(A)):
   ...:     A[i] = (A[i]*2 + A[i]*4) ** (1/3)
   ...:
427 ms ± 6.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %%timeit
   ...: A = np.arange(1, 100000+1)
   ...: A = (A * 6) ** (1/3)
   ...:
2.72 ms ± 51 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The vectorized code takes about 0.6% of the time taken by the naive loop; merely parallelizing the naive loop would never come close to achieving that sort of speedup. Adding the .astype(np.int64) cast only increases runtime by about 6%, still a trivial fraction of what the original for loop required.

Doing for loop computations faster

Answers (2)

Related Questions