jjg
jjg

Reputation: 1024

Numpy, replace a broadcast by iteration

I have the following code snippet

def norm(x1, x2):
    return np.sqrt(((x1 - x2)**2).sum(axis=0))

def call_norm(x1, x2):
    x1 = x1[..., :, np.newaxis]
    x2 = x2[..., np.newaxis, :]
    return norm(x1, x2)

As I understand it, each x represents an array of points in N dimensional space, where N is the size of the final dimension of the array (so for points in 3-space the final dimension is size 3). It inserts extra dimensions and uses broadcasting to generate the cartesian product of these sets of points, and so calculates the distance between all pairs of points.

x = np.array([[1, 2, 3],[1, 2, 3]])
call_norm(x, x)
array([[ 0.        ,  1.41421356,  2.82842712],
       [ 1.41421356,  0.        ,  1.41421356],
       [ 2.82842712,  1.41421356,  0.        ]])

(so the distance between[1,1] and [2,2] is 1.41421356, as expected)

I find that for moderate size problems this approach can use huge amounts of memory. I can easily "de-vectorise" the problem and replace it by iteration, but I'd expect that to be slow. I there a (reasonably) easy compromise solution where I could have most of the speed advantages of vectorisation but without the memory penalty? Some fancy generator trick?

Upvotes: 1

Views: 218

Answers (1)

jakevdp
jakevdp

Reputation: 86433

There is no way to do this kind of computation without the memory penalty with numpy vectorization. For the specific case of efficiently computing pairwise distance matrices, packages tend to get around this by implementing things in C (e.g. scipy.spatial.distance) or in Cython (e.g. sklearn.metrics.pairwise).

If you want to do this "by-hand", so to speak, using numpy-style syntax but without incurring the memory penalty, the current best option is probably dask.array, which automates the construction and execution of flexible task graphs for batch execution using a numpy-style syntax.

Here's an example of using dask for this computation:

import dask.array as da

# Create the chunked data. This can be created
# from numpy arrays as well, e.g. x_dask = da.array(x_numpy)
x = da.random.random((100, 3), chunks=5)
y = da.random.random((200, 3), chunks=5)

# Compute the task graph (syntax just like numpy!)
diffs = x[:, None, :] - y[None, :, :]
dist = da.sqrt((diffs ** 2).sum(-1))

# Execute the task graph
result = dist.compute()
print(result.shape)
# (100, 200)

You'll find that dask is much more memory efficient than NumPy, is often more computationally efficient than NumPy, and can also be computed in parallel/out-of core relatively straightforwardly.

Upvotes: 1

Related Questions