Grouping a set of points by proximity

Question

I have a few thousand points expressed as a 2D float array of latitudes and longitudes.

(42.385305, -87.963793)
(41.703427, -88.121665)
(41.889764, -87.978553)
(41.995931, -87.787501)
(42.25875, -87.948199)
              .
              .
              .

In this set, the minimum and maximum of the latitudes are 34.03176 and 42.470814, and those of the longitudes are -118.238819 and -87.598201.

I want to group this points into zones of 0.025 latitude and 0.03 longitude, and then consider each zone once, to do some calculations and operations on the points at each zone.

Or, it would be much much better if I could find the area where two or more points are located too closely together, say within 3km radius.

I've thought of using a hash map or 2D array, but then setting effective keys or referencing the right zone is going to be tricky.

R-tree may not be appropriate, as its building is complicated and not efficient, especially considering that I do not need random access. I am traversing each zone one by one, as I mentioned above.

What would be an efficient way to do this?

CT Zhu · Accepted Answer

If you fully vectorize the distance calculation, a few thousand points shouldn't take that long:

In [1]:
from numpy import *
In [3]:
def lg_lat_distance(p1,p2): #based on Spherical Law of Cosines
    lg1=p1[0] #data format, (latitude, longitude)
    la1=p1[1]
    lg2=p2[0]
    la2=p2[1]
    return arccos(sin(la1)*sin(la2)+cos(la1)*cos(la2)*cos(lg1-lg2))*6371 #in km
In [14]:
data=array([(42.385305, -87.963793),
            (41.703427, -88.121665),
            (41.889764, -87.978553),
            (41.995931, -87.787501),
            (42.25875, -87.948199)]) #5 elements
data=data/180*pi
In [16]:
dist_matrix=(lg_lat_distance(hstack([data,]*5).reshape(-1,2).T, vstack([data,]*5).T)).reshape(5,5)
print dist_matrix

[[  9.49352980e-05   1.77442357e+01   2.54929710e+00   1.96682533e+01
    1.80515399e+00]
 [  1.77442357e+01   0.00000000e+00   1.59289162e+01   3.71753501e+01
    1.94041828e+01]
 [  2.54929710e+00   1.59289162e+01   0.00000000e+00   2.12484793e+01
    3.67668607e+00]
 [  1.96682533e+01   3.71753501e+01   2.12484793e+01   0.00000000e+00
    1.79018035e+01]
 [  1.80515399e+00   1.94041828e+01   3.67668607e+00   1.79018035e+01
    9.49352980e-05]]

In [17]:
%timeit dist_matrix=(lg_lat_distance(hstack([data,]*5).reshape(-1,2).T, vstack([data,]*5).T)).reshape(5,5)
1000 loops, best of 3: 245 µs per loop

I think things will become easy one you got that dist_matrix. You can filter out the pairs with pairwise distance<5 km use boolean indexing. Or you can run a cluster analysis.

Grouping a set of points by proximity

Answers (1)

Related Questions