Reputation:
I have some large numpy arrays of complex numbers I need to perform computations on.
import numpy as np
# Reduced sizes -- real ones are orders of magnitude larger
n, d, l = 50000, 3, 1000
# Two complex matrices and a column vector
x = np.random.rand(n, d) + 1j*np.random.rand(n, d)
s = np.random.rand(n, d) + 1j*np.random.rand(n, d)
v = np.random.rand(l)[:, np.newaxis]
The function is basically x*v*s
for each row of x
(and s
) and then that product is summed across the row. Because the arrays are different sizes, I can't figure out a way to vectorize the computation and it's way too slow to use a for-loop.
My current implementation is this (~3.5 seconds):
h = []
for i in range(len(x)):
h.append(np.sum(x[i,:]*v*s[i,:], axis=1))
h = np.asarray(h)
I also tried using np.apply_along_axis()
with an augmented matrix but it's only slightly faster (~2.6s) and not that readable.
def func(m, v):
return np.sum(m[:d]*v*m[d:], axis=1)
h = np.apply_along_axis(func, 1, np.hstack([x, s]), v)
What's a much quicker way to compute this result? I can leverage other packages such as dask if that helps.
Upvotes: 0
Views: 96
Reputation: 231325
With broadcasting this should work:
np.sum(((x*s)[...,None]*v[:,0], axis=1)
but with your sample dimensions I'm getting a memory error. The 'outer' broadcasted array (n,d,l)
shape is too large for my memory.
I can reduce memory usage by iterating on the smaller d
dimension:
res = np.zeros((n,l), dtype=x.dtype)
for i in range(d):
res += (x[:,i]*s[:,i])[:,None]*v[:,0]
This tests the same as your h
, but I wasn't able to complete time tests. Generally iterating on the smaller dimension is faster.
I may repeat things with small dimensions.
This probably can also be expressed as an einsum
problem, though it may not help with these dimensions.
In [1]: n, d, l = 5000, 3, 1000
...:
...: # Two complex matrices and a column vector
...: x = np.random.rand(n, d) + 1j*np.random.rand(n, d)
...: s = np.random.rand(n, d) + 1j*np.random.rand(n, d)
...: v = np.random.rand(l)[:, np.newaxis]
In [2]:
In [2]: h = []
...: for i in range(len(x)):
...: h.append(np.sum(x[i,:]*v*s[i,:], axis=1))
...:
...: h = np.asarray(h)
In [3]: h.shape
Out[3]: (5000, 1000)
In [4]: res = np.zeros((n,l), dtype=x.dtype)
...: for i in range(d):
...: res += (x[:,i]*s[:,i])[:,None]*v[:,0]
...:
In [5]: res.shape
Out[5]: (5000, 1000)
In [6]: np.allclose(res,h)
Out[6]: True
In [7]: %%timeit
...: h = []
...: for i in range(len(x)):
...: h.append(np.sum(x[i,:]*v*s[i,:], axis=1))
...: h = np.asarray(h)
...:
...:
490 ms ± 3.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %%timeit
...: res = np.zeros((n,l), dtype=x.dtype)
...: for i in range(d):
...: res += (x[:,i]*s[:,i])[:,None]*v[:,0]
...:
354 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [9]:
In [9]: np.sum((x*s)[...,None]*v[:,0], axis=1).shape
Out[9]: (5000, 1000)
In [10]: out = np.sum((x*s)[...,None]*v[:,0], axis=1)
In [11]: np.allclose(h,out)
Out[11]: True
In [12]: timeit out = np.sum((x*s)[...,None]*v[:,0], axis=1)
310 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Some time savings, but not big.
And the einsum
version:
In [13]: np.einsum('ij,ij,k->ik',x,s,v[:,0]).shape
Out[13]: (5000, 1000)
In [14]: np.allclose(np.einsum('ij,ij,k->ik',x,s,v[:,0]),h)
Out[14]: True
In [15]: timeit np.einsum('ij,ij,k->ik',x,s,v[:,0]).shape
167 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Good time savings. But I don't know how it will scale.
But the einsum
made me realize that we can sum on d
dimension earlier, before multiplying by v
- and gain a lot in time and memory usage:
In [16]: np.allclose(np.sum(x*s, axis=1)[:,None]*v[:,0],h)
Out[16]: True
In [17]: timeit np.sum(x*s, axis=1)[:,None]*v[:,0]
68.4 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
@cs95 got there first!
As per @PaulPanzer's comment, the optimize flag helps. It's probably making the same deduction - that we can sum on j
early:
In [18]: timeit np.einsum('ij,ij,k->ik',x,s,v[:,0],optimize=True).shape
91.6 ms ± 991 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 2