Reputation: 978
I have two 3 dimensional numpy arrays A, B (size ~ (1000, 1000, 3) -> image processing) and the element-wise functions on it.
The functions are sequentially:
import numpy as np
A = A ** 3
A = np.maximum(A, 0.001)
C = np.divide(B, A)
Since the function operating these 3 commands is the bottleneck of a time-demanding process, I would like to ask whether there is a way to perform all of those with a single access to each element in memory, i.e. the fastest performance.
The only combinations I could find was the divide part, e.g. here or here, or this one which is a special case because of the einstein sum.
Is there any way to do that accessing each element in the memory one time (thus make it time-efficient) without the need to write a custom ufunc?
Upvotes: 1
Views: 183
Reputation: 221774
Frankly, numexpr
one-liner listed in @ali_m's solution
looks like the way to go, given the associated speedup numbers. Keeping things in NumPy, listed in this post is an alternative suggestion.
Let's time those instructions one at a time and see if there's any bottleneck -
In [108]: # Random input arrays
...: A = np.random.rand(1000,1000,3)
...: B = np.random.rand(1000,1000,3)
...:
In [109]: %timeit A**3
1 loops, best of 3: 442 ms per loop
In [110]: A = A ** 3
In [111]: %timeit np.maximum(A, 0.001)
100 loops, best of 3: 16.4 ms per loop
In [112]: A = np.maximum(A, 0.001)
In [113]: %timeit np.divide(B, A)
10 loops, best of 3: 19.7 ms per loop
So, it seems the power calculation takes up a huge portion of the total runtime.
Let's introduce np.einsum
there, but please be aware that of the datatypes involved.
In [114]: # Random input arrays
...: A = np.random.rand(1000,1000,3)
...: B = np.random.rand(1000,1000,3)
...:
In [115]: %timeit A**3
1 loops, best of 3: 442 ms per loop
In [116]: %timeit np.einsum('ijk,ijk,ijk->ijk',A,A,A)
10 loops, best of 3: 28.3 ms per loop
In [117]: np.allclose(A**3,np.einsum('ijk,ijk,ijk->ijk',A,A,A))
Out[117]: True
That's a good speedup there.
Upvotes: 2
Reputation: 74262
Is there any way to do that accessing each element in the memory one time (thus make it time-efficient) without the need to write a custom ufunc?
Yes, this is exactly what numexpr
was designed for.
import numpy as np
import numexpr as ne
def func1(A, B):
A = A ** 3
A = np.maximum(A, 0.001)
return np.divide(B, A)
def func2(A, B):
return ne.evaluate("B / where(A**3 > 0.001, A**3, 0.001)",
local_dict={'A':A,'B':B})
A, B = np.random.randn(2, 1000, 1000, 3)
print(np.allclose(func1(A, B), func2(A, B)))
# True
numexpr
gives about a factor of 70 improvement over your original code:
In [1]: %%timeit A, B = np.random.randn(2, 1000, 1000, 3)
func1(A, B)
....:
1 loop, best of 3: 837 ms per loop
In [2]: %%timeit A, B = np.random.randn(2, 1000, 1000, 3)
func2(A, B)
....:
The slowest run took 8.87 times longer than the fastest. This could mean that an
intermediate result is being cached.
100 loops, best of 3: 11.5 ms per loop
In part this is because numexpr
will use multiple threads for the computation by default, but even with a single thread it still crushes naive vectorization:
In [3]: ne.set_num_threads(1)
Out[3]: 8
In [4]: %%timeit A, B = np.random.randn(2, 1000, 1000, 3)
func2(A, B)
....:
10 loops, best of 3: 47.3 ms per loop
Upvotes: 3