GrigorisG
GrigorisG

Reputation: 978

Time-efficient combination of numpy functions

I have two 3 dimensional numpy arrays A, B (size ~ (1000, 1000, 3) -> image processing) and the element-wise functions on it.

The functions are sequentially:

import numpy as np
A  = A ** 3
A = np.maximum(A, 0.001)
C = np.divide(B, A)

Since the function operating these 3 commands is the bottleneck of a time-demanding process, I would like to ask whether there is a way to perform all of those with a single access to each element in memory, i.e. the fastest performance.

The only combinations I could find was the divide part, e.g. here or here, or this one which is a special case because of the einstein sum.

Is there any way to do that accessing each element in the memory one time (thus make it time-efficient) without the need to write a custom ufunc?

Upvotes: 1

Views: 183

Answers (2)

Divakar
Divakar

Reputation: 221774

Frankly, numexpr one-liner listed in @ali_m's solution looks like the way to go, given the associated speedup numbers. Keeping things in NumPy, listed in this post is an alternative suggestion.

Let's time those instructions one at a time and see if there's any bottleneck -

In [108]: # Random input arrays
     ...: A = np.random.rand(1000,1000,3)
     ...: B = np.random.rand(1000,1000,3)
     ...: 

In [109]: %timeit A**3
1 loops, best of 3: 442 ms per loop

In [110]: A  = A ** 3

In [111]: %timeit np.maximum(A, 0.001)
100 loops, best of 3: 16.4 ms per loop

In [112]: A = np.maximum(A, 0.001)

In [113]: %timeit np.divide(B, A)
10 loops, best of 3: 19.7 ms per loop

So, it seems the power calculation takes up a huge portion of the total runtime.

Let's introduce np.einsum there, but please be aware that of the datatypes involved.

In [114]: # Random input arrays
     ...: A = np.random.rand(1000,1000,3)
     ...: B = np.random.rand(1000,1000,3)
     ...: 

In [115]: %timeit A**3
1 loops, best of 3: 442 ms per loop

In [116]: %timeit np.einsum('ijk,ijk,ijk->ijk',A,A,A)
10 loops, best of 3: 28.3 ms per loop

In [117]: np.allclose(A**3,np.einsum('ijk,ijk,ijk->ijk',A,A,A))
Out[117]: True

That's a good speedup there.

Upvotes: 2

ali_m
ali_m

Reputation: 74262

Is there any way to do that accessing each element in the memory one time (thus make it time-efficient) without the need to write a custom ufunc?

Yes, this is exactly what numexpr was designed for.

import numpy as np
import numexpr as ne

def func1(A, B):
    A = A ** 3
    A = np.maximum(A, 0.001)
    return np.divide(B, A)

def func2(A, B):
    return ne.evaluate("B / where(A**3 > 0.001, A**3, 0.001)",
                       local_dict={'A':A,'B':B})

A, B = np.random.randn(2, 1000, 1000, 3)

print(np.allclose(func1(A, B), func2(A, B)))
# True

numexpr gives about a factor of 70 improvement over your original code:

In [1]: %%timeit A, B = np.random.randn(2, 1000, 1000, 3)
func1(A, B)
   ....: 
1 loop, best of 3: 837 ms per loop

In [2]: %%timeit A, B = np.random.randn(2, 1000, 1000, 3)
func2(A, B)
   ....: 
The slowest run took 8.87 times longer than the fastest. This could mean that an
intermediate result is being cached.
100 loops, best of 3: 11.5 ms per loop

In part this is because numexpr will use multiple threads for the computation by default, but even with a single thread it still crushes naive vectorization:

In [3]: ne.set_num_threads(1)
Out[3]: 8

In [4]: %%timeit A, B = np.random.randn(2, 1000, 1000, 3)
func2(A, B)
   ....: 
10 loops, best of 3: 47.3 ms per loop

Upvotes: 3

Related Questions