Reputation: 141
Toy example
I have two arrays, which have different shape, for example:
import numpy as np
matrix = np.arange(5*6*7*8).reshape(5, 6, 7, 8)
vector = np.arange(1, 20, 2)
What I want to do is to multiply each element of the matrix by one of the element of 'vector' and then do the sum over the last two axis. For that, I have an array with the same shape as 'matrix' that tells me which index to use, for example:
Idx = (matrix+np.random.randint(0, vector.size, size=matrix.shape))%vector.size
I know that one of the solution would be to do:
matVec = vector[Idx]
res = np.sum(matrix*matVec, axis=(2, 3))
or even:
res = np.einsum('ijkl, ijkl -> ij', matrix, matVec)
Problem
However, my problems is that my arrays are big and the construction of matVec takes both times and memory. So is there a way to bypass that and still achieve the same result ?
More realistic example
Here is a more realistic example of what I'm actually doing:
import numpy as np
order = 20
dim = 23
listOrder = np.arange(-order, order+1, 1)
N, P = np.meshgrid(listOrder, listOrder)
K = np.arange(-2*dim+1, 2*dim+1, 1)
X = np.arange(-2*dim, 2*dim, 1)
tN = np.einsum('..., p, x -> ...px', N, np.ones(K.shape, dtype=int), np.ones(X.shape, dtype=int))#, optimize=pathInt)
tP = np.einsum('..., p, x -> ...px', P, np.ones(K.shape, dtype=int), np.ones(X.shape, dtype=int))#, optimize=pathInt)
tK = np.einsum('..., p, x -> ...px', np.ones(P.shape, dtype=int), K, np.ones(X.shape, dtype=int))#, optimize=pathInt)
tX = np.einsum('..., p, x -> ...px', np.ones(P.shape, dtype=int), np.ones(K.shape, dtype=int), X)#, optimize=pathInt)
tL = tK + tX
mini, maxi = -4*dim+1, 4*dim-1
NmPp2L = np.arange(2*mini-2*order, 2*maxi+2*order+1)
Idx = (2*tL+tN-tP) - NmPp2L[0]
np.random.seed(0)
matrix = (np.random.rand(Idx.size) + 1j*np.random.rand(Idx.size)).reshape(Idx.shape)
vector = np.random.rand(np.max(Idx)+1) + 1j*np.random.rand(np.max(Idx)+1)
res = np.sum(matrix*vector[Idx], axis=(2, 3))
Upvotes: 4
Views: 451
Reputation: 5036
For larger data arrays
import numpy as np
matrix = np.arange(50*60*70*80).reshape(50, 60, 70, 80)
vector = np.arange(1, 50, 2)
Idx = (matrix+np.random.randint(0, vector.size, size=matrix.shape))%vector.size
parallel numba
speeds up the computation and avoids creating matVec.
On a 4-core Intel Xeon Platinum 8259CL
matVec = vector[Idx]
# %timeit 48.4 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
res = np.einsum('ijkl, ijkl -> ij', matrix, matVec)
# %timeit 26.9 ms ± 40.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
against an unoptimized numba
implementation
import numba as nb
@nb.njit(parallel=True)
def func(matrix, idx, vector):
res = np.zeros((matrix.shape[0],matrix.shape[1]), dtype=matrix.dtype)
for i in nb.prange(matrix.shape[0]):
for j in range(matrix.shape[1]):
for k in range(matrix.shape[2]):
for l in range(matrix.shape[3]):
res[i,j] += matrix[i,j,k,l] * vector[idx[i,j,k,l]]
return res
func(matrix, Idx, vector) # compile run
func(matrix, Idx, vector)
# %timeit 21.7 ms ± 486 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# (48.4 + 26.9) / 21.7 = ~3.47x speed up
np.testing.assert_equal(func(matrix, Idx, vector), np.einsum('ijkl, ijkl -> ij', matrix, matVec))
The above Numba code should be memory-bound when dealing with complex numbers. Indeed, matrix
and Idx
are big and must be completely read from the relatively-slow RAM. matrix
has a size of 41*41*92*92*8*2 = 217.10 MiB
and Idx
a size of either 41*41*92*92*8 = 108.55 MiB
or 41*41*92*92*4 = 54.28 MiB
regarding the target platform (it should be of type int32 on most x86-64 Windows platforms and int64 on most Linux x86-64 platforms). This is also why vector[Idx]
was slow: Numpy needs to write a big array in memory (not to mention writing data should be twice slower than reading it on x86-64 platforms in this case).
Assuming the code is memory bound, the only way to speed it up is to reduce the amount of data read from the RAM. This can be achieve by storing Idx
in a uint16
array instead of the default np.int_
datatype (2~4 bigger). This is possible since vector.size
is small. On a Linux with a i5-9600KF processor and a 38-40 GiB/s RAM, this optimization resulted in a ~29% speed up while the code is still mainly memory bound. The implementation is nearly optimal on this platform.
Upvotes: 5