Stéphane
Stéphane

Reputation: 1344

smart computation of distances over a (sparse ?) numpy.meshgrid

I want to compute distances between every pairs of nodes on a regular 3D-grid. The grid may be huge, so I want to optimize the computation (privilege to the CPU).

After many many tests, I gave up the ‘scitools’ module, that is convenient only for python2, and use a combination of numpy.meshgrid() and scipy.spatial.distance.pdist(). For instance, for a 20x20x20 grid:

distances = scipy.spatial.distance.pdist(np.transpose(np.array(np.meshgrid(range(-10,10,1),range(-10,10,1),range(-10,10,1)))).reshape([20**3,3]))

Is it optimized ?

first 'Maybe this is the way to go...': I see there is a ‘sparse’ option in meshgrid, so it would please me to use

np.meshgrid(range(-10,10,1),range(-10,10,1),range(-10,10,1),sparse=True)

rather than

np.meshgrid(range(-10,10,1),range(-10,10,1),range(-10,10,1))

Indeed, I could even keep the sparse shape in memory since it would be convenient later in the code. But I don’t find a satisfying syntaxe to combine pdist() with a sparse meshgrid.

Upvotes: 0

Views: 800

Answers (2)

Kyle Pena
Kyle Pena

Reputation: 537

I'm not sure what using a sparse meshgrid is going to buy you. You may save on space in the meshgrid itself, but the same number of pairwise distances will still need to be computed, and the end result will still be a "condensed" matrix per the documentation on pdist.

If you are attempting to optimize the number of distance computations that will be performed, the fact that the grid is regularly spaced immediately suggests some optimizations.

Here is an algorithm:

  1. Loop through all pairs of coordinates

  2. Create a "distance cache" which uses the difference between the coordinate pairs as the key. For example, distance_cache[(3,4)] = 5. That way, if two coordinates which have the same relative distance from each other are found, the distance between the coordinates is simply looked up from the cache instead of recomputed.

  3. Bonus points: Only store keys whose x and y are relatively prime in the distance cache. Due to triangle similarity, the same cache entry can be re-used for multiples of the key. For example, distance([6,8], [0,0]) = 2 * d[(3,4)] = 10

Upvotes: 1

hpaulj
hpaulj

Reputation: 231615

In [494]: [x.shape for x in np.meshgrid(range(-10,10,1),range(-10,10,1),range(-1
     ...: 0,10,1),sparse=False)]
Out[494]: [(20, 20, 20), (20, 20, 20), (20, 20, 20)]
In [495]: [x.shape for x in np.meshgrid(range(-10,10,1),range(-10,10,1),range(-1
     ...: 0,10,1),sparse=True)]
Out[495]: [(1, 20, 1), (20, 1, 1), (1, 1, 20)]

The non-sparse meshgrid produces 3 3d arrays, which you join and reshape into your 3 column array. The sparse version also produces 3d arrays, but each has a different shape. With broadcasting they can be used in the same way. For example if summed or multiplied, the result in both cases is a (20,20,20) array. But sparse ones cannot be made into that (20*20*20,3) array without expansion.

These are not scipy sparse arrays. That's an entirely different concept.

Look at one of those (20,20,20) arrays. See all the repeated columns or rows? sparse just avoids repeating those. It just takes the 20 element range and reshapes it. It lets broadcasting do the implicit repetitions.

Upvotes: 0

Related Questions