Görg
Görg

Reputation: 141

How to multiply two arrays at specific indexes?

Toy example

I have two arrays, which have different shape, for example:

import numpy as np
matrix = np.arange(5*6*7*8).reshape(5, 6, 7, 8)
vector = np.arange(1, 20, 2)

What I want to do is to multiply each element of the matrix by one of the element of 'vector' and then do the sum over the last two axis. For that, I have an array with the same shape as 'matrix' that tells me which index to use, for example:

Idx = (matrix+np.random.randint(0, vector.size, size=matrix.shape))%vector.size

I know that one of the solution would be to do:

matVec = vector[Idx]
res = np.sum(matrix*matVec, axis=(2, 3))

or even:

res = np.einsum('ijkl, ijkl -> ij', matrix, matVec)

Problem

However, my problems is that my arrays are big and the construction of matVec takes both times and memory. So is there a way to bypass that and still achieve the same result ?

More realistic example

Here is a more realistic example of what I'm actually doing:

import numpy as np

order = 20
dim = 23

listOrder = np.arange(-order, order+1, 1)
N, P = np.meshgrid(listOrder, listOrder)
K = np.arange(-2*dim+1, 2*dim+1, 1)
X = np.arange(-2*dim, 2*dim, 1)

tN = np.einsum('..., p, x -> ...px', N, np.ones(K.shape, dtype=int), np.ones(X.shape, dtype=int))#, optimize=pathInt)
tP = np.einsum('..., p, x -> ...px', P, np.ones(K.shape, dtype=int), np.ones(X.shape, dtype=int))#, optimize=pathInt)
tK = np.einsum('..., p, x -> ...px', np.ones(P.shape, dtype=int), K, np.ones(X.shape, dtype=int))#, optimize=pathInt)
tX = np.einsum('..., p, x -> ...px', np.ones(P.shape, dtype=int), np.ones(K.shape, dtype=int), X)#, optimize=pathInt)
tL = tK + tX

mini, maxi = -4*dim+1, 4*dim-1
NmPp2L = np.arange(2*mini-2*order, 2*maxi+2*order+1)
Idx = (2*tL+tN-tP) - NmPp2L[0]

np.random.seed(0)
matrix = (np.random.rand(Idx.size) + 1j*np.random.rand(Idx.size)).reshape(Idx.shape)
vector = np.random.rand(np.max(Idx)+1) + 1j*np.random.rand(np.max(Idx)+1)

res = np.sum(matrix*vector[Idx], axis=(2, 3))

Upvotes: 4

Views: 451

Answers (1)

Michael Szczesny
Michael Szczesny

Reputation: 5036

For larger data arrays

import numpy as np

matrix = np.arange(50*60*70*80).reshape(50, 60, 70, 80)
vector = np.arange(1, 50, 2)
Idx = (matrix+np.random.randint(0, vector.size, size=matrix.shape))%vector.size

parallel numba speeds up the computation and avoids creating matVec.

On a 4-core Intel Xeon Platinum 8259CL

matVec = vector[Idx]
# %timeit 48.4 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

res = np.einsum('ijkl, ijkl -> ij', matrix, matVec)
# %timeit 26.9 ms ± 40.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

against an unoptimized numba implementation

import numba as nb

@nb.njit(parallel=True)
def func(matrix, idx, vector):
    res = np.zeros((matrix.shape[0],matrix.shape[1]), dtype=matrix.dtype)
    for i in nb.prange(matrix.shape[0]):
        for j in range(matrix.shape[1]):
            for k in range(matrix.shape[2]):
                for l in range(matrix.shape[3]):
                    res[i,j] += matrix[i,j,k,l] * vector[idx[i,j,k,l]]
    return res

func(matrix, Idx, vector)  # compile run
func(matrix, Idx, vector)
# %timeit 21.7 ms ± 486 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# (48.4 + 26.9) / 21.7 = ~3.47x speed up

np.testing.assert_equal(func(matrix, Idx, vector), np.einsum('ijkl, ijkl -> ij', matrix, matVec))

Performance and further optimizations

The above Numba code should be memory-bound when dealing with complex numbers. Indeed, matrix and Idx are big and must be completely read from the relatively-slow RAM. matrix has a size of 41*41*92*92*8*2 = 217.10 MiB and Idx a size of either 41*41*92*92*8 = 108.55 MiB or 41*41*92*92*4 = 54.28 MiB regarding the target platform (it should be of type int32 on most x86-64 Windows platforms and int64 on most Linux x86-64 platforms). This is also why vector[Idx] was slow: Numpy needs to write a big array in memory (not to mention writing data should be twice slower than reading it on x86-64 platforms in this case).

Assuming the code is memory bound, the only way to speed it up is to reduce the amount of data read from the RAM. This can be achieve by storing Idx in a uint16 array instead of the default np.int_ datatype (2~4 bigger). This is possible since vector.size is small. On a Linux with a i5-9600KF processor and a 38-40 GiB/s RAM, this optimization resulted in a ~29% speed up while the code is still mainly memory bound. The implementation is nearly optimal on this platform.

Upvotes: 5

Related Questions