Ulderique Demoitre
Ulderique Demoitre

Reputation: 1068

Speed up numpy summation between big arrays

I have a code running operations on numpy arrays. While linear algebra operations seem fast, I now am finding a bottleneck in a different issue: the summation of two distinct arrays. In the example below WE3 and T1 are two 1000X1000X1000 arrays. First I calculate WE3 using a numpy operation, then I sum those arrays.

import numpy as np
import scipy as sp
import time
N = 100
n = 1000
X = np.random.uniform(size = (N,n))
wE = np.mean(X,0)
wE3 = np.einsum('i,j,k->ijk', wE, wE, wE) #22 secs
T1 = np.random.uniform(size = (n,n,n))
a = wE3 + T1 #115 secs

The calculation of wE3 takes like 22 seconds, while the addition between WE3 and T1 takes 115 seconds.

Is there any known reason why the summation of those arrays is such slower than the calculation of WE3? They should have more or less the same complexity..

Is there a way to speed up that code?

Upvotes: 0

Views: 366

Answers (3)

Mansweet
Mansweet

Reputation: 151

You should really use Numba's Jit (just in time compiler) for this. It is a purely numpy pipeline, which is perfect for Numba.

All you have to do is throw that above code into a function, and put an @jit decorater on top. It gets speedups close to Cython.

However, as others have pointed out, it appears you're trying to work with data too large for your local machine, and numba would not solve your problems

Upvotes: 0

Divakar
Divakar

Reputation: 221744

That np.einsum('i,j,k->ijk', wE, wE, wE) part isn't doing any sum-reduction and is essentially just broadcasted elementwise multiplication. So, we can replace that with something like this -

wE[:,None,None] * wE[:,None] * wE

Runtime test -

In [9]: # Setup inputs at 1/5th of original dataset sizes
   ...: N = 20
   ...: n = 200
   ...: X = np.random.uniform(size = (N,n))
   ...: wE = np.mean(X,0)
   ...: 

In [10]: %timeit np.einsum('i,j,k->ijk', wE, wE, wE)
10 loops, best of 3: 45.7 ms per loop

In [11]: %timeit wE[:,None,None] * wE[:,None] * wE
10 loops, best of 3: 26.1 ms per loop

Next up, we have wE3 + T1, where T1 = np.random.uniform(size = (n,n,n)) doesn't look like could be helped in a big way as we have to create T1 anyway and then it's just element-wise addition. It seems we can use np.add that lets us write back the results to one of the arrays : wE3 or T1. Let's say we choose T1, if that's okay to be modified. I guess this would bring slight memory efficiency as we won't be adding another variable into workspace.

Thus, we could do -

np.add(wE3,T1,out=T1)

Runtime test -

In [58]: def func1(wE3):
    ...:     T1 = np.random.uniform(size = (n,n,n))
    ...:     return wE3 + T1
    ...: 
    ...: def func2(wE3):
    ...:     T1 = np.random.uniform(size = (n,n,n))
    ...:     np.add(wE3,T1,out=T1)
    ...:     return T1
    ...: 

In [59]: # Setup inputs at 1/4th of original dataset sizes
    ...: N = 25
    ...: n = 250
    ...: X = np.random.uniform(size = (N,n))
    ...: wE = np.mean(X,0)
    ...: wE3 = np.einsum('i,j,k->ijk', wE, wE, wE)
    ...: 

In [60]: %timeit func1(wE3)
1 loops, best of 3: 390 ms per loop

In [61]: %timeit func2(wE3)
1 loops, best of 3: 363 ms per loop

Using @Aaron's suggestion, we can use a loop and assuming that writing back the results into wE3 is okay, we could do -

wE3 = wE[:,None,None] * wE[:,None] * wE
for x in wE3: 
    np.add(x, np.random.uniform(size = (n,n)), out=x)

Final results

Thus, putting back all the suggested improvements, finally the runtime test results were -

In [97]: def func1(wE):
    ...:     wE3 = np.einsum('i,j,k->ijk', wE, wE, wE)
    ...:     T1 = np.random.uniform(size = (n,n,n))
    ...:     return wE3 + T1
    ...: 
    ...: def func2(wE):
    ...:     wE3 = wE[:,None,None] * wE[:,None] * wE
    ...:     for x in wE3: 
    ...:         np.add(x, np.random.uniform(size = (n,n)), out=x)
    ...:     return wE3
    ...: 

In [98]: # Setup inputs at 1/3rd of original dataset sizes
    ...: N = 33
    ...: n = 330
    ...: X = np.random.uniform(size = (N,n))
    ...: wE = np.mean(X,0)
    ...: 

In [99]: %timeit func1(wE)
1 loops, best of 3: 1.09 s per loop

In [100]: %timeit func2(wE)
1 loops, best of 3: 879 ms per loop

Upvotes: 1

Warren Weckesser
Warren Weckesser

Reputation: 114976

Is there any known reason why the summation of those arrays is such slower than the calculation of WE3?

The arrays wE3, T1 and a each require 8 gigabytes of memory. You are probably running out of physical memory, and swap memory access is killing your performance.

Is there a way to speed up that code?

Get more physical memory (i.e. RAM).

If that is not possible, take a look at what you are going to do with these arrays, and see if you can work in batches such that the total memory required when processing a batch remains within the limits of your physical memory.

Upvotes: 1

Related Questions