Reputation: 1068
I have a code running operations on numpy
arrays.
While linear algebra operations seem fast, I now am finding a bottleneck in a different issue: the summation of two distinct arrays.
In the example below WE3
and T1
are two 1000X1000X1000
arrays.
First I calculate WE3
using a numpy operation, then I sum those arrays.
import numpy as np
import scipy as sp
import time
N = 100
n = 1000
X = np.random.uniform(size = (N,n))
wE = np.mean(X,0)
wE3 = np.einsum('i,j,k->ijk', wE, wE, wE) #22 secs
T1 = np.random.uniform(size = (n,n,n))
a = wE3 + T1 #115 secs
The calculation of wE3
takes like 22 seconds, while the addition between WE3
and T1
takes 115 seconds.
Is there any known reason why the summation of those arrays is such slower than the calculation of WE3? They should have more or less the same complexity..
Is there a way to speed up that code?
Upvotes: 0
Views: 366
Reputation: 151
You should really use Numba's Jit (just in time compiler) for this. It is a purely numpy pipeline, which is perfect for Numba.
All you have to do is throw that above code into a function, and put an @jit decorater on top. It gets speedups close to Cython.
However, as others have pointed out, it appears you're trying to work with data too large for your local machine, and numba would not solve your problems
Upvotes: 0
Reputation: 221744
That np.einsum('i,j,k->ijk', wE, wE, wE)
part isn't doing any sum-reduction and is essentially just broadcasted elementwise multiplication. So, we can replace that with something like this -
wE[:,None,None] * wE[:,None] * wE
Runtime test -
In [9]: # Setup inputs at 1/5th of original dataset sizes
...: N = 20
...: n = 200
...: X = np.random.uniform(size = (N,n))
...: wE = np.mean(X,0)
...:
In [10]: %timeit np.einsum('i,j,k->ijk', wE, wE, wE)
10 loops, best of 3: 45.7 ms per loop
In [11]: %timeit wE[:,None,None] * wE[:,None] * wE
10 loops, best of 3: 26.1 ms per loop
Next up, we have wE3 + T1
, where T1 = np.random.uniform(size = (n,n,n))
doesn't look like could be helped in a big way as we have to create T1
anyway and then it's just element-wise addition. It seems we can use np.add
that lets us write back the results to one of the arrays : wE3
or T1
. Let's say we choose T1
, if that's okay to be modified. I guess this would bring slight memory efficiency as we won't be adding another variable into workspace.
Thus, we could do -
np.add(wE3,T1,out=T1)
Runtime test -
In [58]: def func1(wE3):
...: T1 = np.random.uniform(size = (n,n,n))
...: return wE3 + T1
...:
...: def func2(wE3):
...: T1 = np.random.uniform(size = (n,n,n))
...: np.add(wE3,T1,out=T1)
...: return T1
...:
In [59]: # Setup inputs at 1/4th of original dataset sizes
...: N = 25
...: n = 250
...: X = np.random.uniform(size = (N,n))
...: wE = np.mean(X,0)
...: wE3 = np.einsum('i,j,k->ijk', wE, wE, wE)
...:
In [60]: %timeit func1(wE3)
1 loops, best of 3: 390 ms per loop
In [61]: %timeit func2(wE3)
1 loops, best of 3: 363 ms per loop
Using @Aaron's suggestion
, we can use a loop and assuming that writing back the results into wE3
is okay, we could do -
wE3 = wE[:,None,None] * wE[:,None] * wE
for x in wE3:
np.add(x, np.random.uniform(size = (n,n)), out=x)
Final results
Thus, putting back all the suggested improvements, finally the runtime test results were -
In [97]: def func1(wE):
...: wE3 = np.einsum('i,j,k->ijk', wE, wE, wE)
...: T1 = np.random.uniform(size = (n,n,n))
...: return wE3 + T1
...:
...: def func2(wE):
...: wE3 = wE[:,None,None] * wE[:,None] * wE
...: for x in wE3:
...: np.add(x, np.random.uniform(size = (n,n)), out=x)
...: return wE3
...:
In [98]: # Setup inputs at 1/3rd of original dataset sizes
...: N = 33
...: n = 330
...: X = np.random.uniform(size = (N,n))
...: wE = np.mean(X,0)
...:
In [99]: %timeit func1(wE)
1 loops, best of 3: 1.09 s per loop
In [100]: %timeit func2(wE)
1 loops, best of 3: 879 ms per loop
Upvotes: 1
Reputation: 114976
Is there any known reason why the summation of those arrays is such slower than the calculation of WE3?
The arrays wE3
, T1
and a
each require 8 gigabytes of memory. You are probably running out of physical memory, and swap memory access is killing your performance.
Is there a way to speed up that code?
Get more physical memory (i.e. RAM).
If that is not possible, take a look at what you are going to do with these arrays, and see if you can work in batches such that the total memory required when processing a batch remains within the limits of your physical memory.
Upvotes: 1