dsapprentice
dsapprentice

Reputation: 102

Vectorisation of coordinate distances in Numpy

I'm trying to understand Numpy by applying vectorisation. I'm trying to find the fastest function to do it.

def get_distances3(coordinates):
    return np.linalg.norm(
        coordinates[:, None, :] - coordinates[None, :, :],
        axis=-1)
coordinates = np.random.rand(1000, 3)
%timeit get_distances3(coordinates)

The function above took 10 loops, best of 3: 35.4 ms per loop. From numpy library there's also an np.vectorize option to do it.

def get_distances4(coordinates):
  return np.vectorize(coordinates[:, None, :] - coordinates[None, :, :],axis=-1)

%timeit get_distances4(coordinates)

I tried with np.vectorize below, yet ended up with the following error.

TypeError: __init__() got an unexpected keyword argument 'axis'

How can I find vectorization in get_distances4? How should I edit the lsat code in order to avoid the error? I have never used np.vectorize, so I might be missing something.

Upvotes: 2

Views: 490

Answers (1)

Iguananaut
Iguananaut

Reputation: 23376

You're not calling np.vectorize() correctly. I suggest referring to the documentation.

Vectorize takes as its argument a function that is written to operate on scalar values, and converts it into a function that can be vectorized over values in arrays according to the Numpy broadcasting rules. It's basically like a fancy map() for Numpy array.

i.e. as you know Numpy already has built-in vectorized versions of many common functions, but if you had some custom function like "my_special_function(x)" and you wanted to be able to call it on Numpy arrays, you could use my_special_function_ufunc = np.vectorize(my_special_function).

In your above example you might "vectorize" your distance function like:

>>> norm = np.linalg.norm
>>> get_distance4 = np.vectorize(lambda a, b: norm(a - b))
>>> get_distance4(coordinates[:, None, :], coordinates[None, :, :])

However, you will find that this is incredibly slow:

>>> %timeit get_distance4(coordinates[:, None, :], coordinates[None, :, :])
1 loop, best of 3: 10.8 s per loop

This is because your first example get_distance3 is already using Numpy's built-in fast implementations of these operations, whereas the np.vectorize version requires calling the Python function I defined some 3000 times.

In fact according to the docs:

The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.

If you want a potentially faster function for converting distances between vectors you could use scipy.spacial.distance.pdist:

>>> %timeit get_distances3(coordinates)
10 loops, best of 3: 24.2 ms per loop
>>> %timeit distance.pdist(coordinates)
1000 loops, best of 3: 1.77 ms per loop

It's worth noting that this has a different return formation. Rather than a 1000x1000 array it uses a condensed format that excludes i = j entries and i > j entries. If you wish you can then use scipy.spatial.distance.squareform to convert back to the square matrix format.

Upvotes: 2

Related Questions