falling_up
falling_up

Reputation: 81

Efficiently computing distance

I am trying to find the number of places within a 30 mile radius for every place. For example, for Springfield, IL, how many towns are in a 30 mile radius?

For every place, I have the name, the latitude, and the longitude and n = 30k.

This problem would be relatively simple if the dataset were smaller:

PROC SQL; 
    CREATE TABLE Distance_Table_1 AS 
        SELECT 
             MASTER.PlaceID AS PlaceID 
            ,Master.INTPTLAT AS LAT1
            ,Master.INTPTLONG AS LONG1
            ,Match.INTPTLAT AS LAT2
            ,Match.INTPTLONG AS LONG2
            ,GEODIST(Master.INTPTLAT, Master.INTPTLONG, Match.INTPTLAT,Match.INTPTLONG,'M') AS DISTANCE
        FROM MASTER_TABLE_CLEANED_ MASTER
        CROSS JOIN MASTER_TABLE_CLEANED_ AS MATCH
        ; 
QUIT; 

I would then create a count of all places within 30 miles for every place.

The problem is that this produces a ridiculously large table that my SAS cannot handle (900M rows).

How might I process this more efficiently?

Upvotes: 0

Views: 138

Answers (3)

Richard
Richard

Reputation: 27508

Reduce the pair selection space.

Use DomPazz PlaceId to restrict the set of pairs to evaluate, and a lattice approach to require a 30 mile nearness approximation per lat and long values.

The pairwise selection criteria would be

where
  fromCity.placeId < toCity.placeId
  and abs(fromCity.lat - toCity.lat) < &precomputed_Max30mileLatRange
  and abd(fromCity.lont - toCity.long) < &precomputed_Max30mileLongRange

Using the information at http://longitudestore.com/how-big-is-one-gps-degree.html:

  • The latitudinal scale is nominally linear and '1 degree` of lat. is ~ 69 miles
  • The longitudinal scale varies, and it takes more degrees of long. to be 30 miles the closer to north or south pole you are. At 80 deg north latitude, 1 degree longitude is roughly 12 miles

So, presuming your map data does not have a place with latitude in excess of 80, the following selection criteria will largely reduce the pairings on which geodistance will need to be computed.

where
  fromCity.placeId < toCity.placeId
  and abs(fromCity.lat - toCity.lat) < 0.5 /* ~35 miles */
  and abs(fromCity.lont - toCity.long) < 2.5 /* anywhere from ~36 miles (at 80 lat to ~175mi at equator */

This all presumes a spherical Earth belief.

Upvotes: 1

Kevin
Kevin

Reputation: 2243

When you're joining a table against itself, you want to try to accomplish a few things to help with performance:

  • Make the resulting dataset as small as possible
  • Make it as easy as possible to compare the two entries

See the problem? You're not reducing the dataset at all, and you're performing a complicated distance calculation 30k x 30k times. Instead of eliminating as many possible records as quickly as you can, you're going ahead and brute-forcing everything up front.

Right off the bat, one easy way of improving performance would be to do something like:

select *
from cities c1
JOIN cities c2
on c1.ID < c2.ID
and c2.Lat between c1.Lat - 30 miles and c1.Lat + 30 miles
and c2.Long between c1.Long - 30 miles and c1.Long + 30 miles

... this will get you a much smaller list of possible candidates. It's not the final answer - you'll have cities that are 25 miles north and 25 miles west of another, which is beyond a total 30 miles. But you've greatly reduced the total number of distance checks you need to do, as well as the dataset you're doing operations on.

After that, you should play around with indexes on the table. My guess is that you'd want an index that includes both the Lat and Long columns together, so that you only need a single index to perform the operation.

This should hopefully get you where you need to go - my guess is that this is all the optimization you'll need. But if you need to make things even faster, you can subdivide the data. After all, nothing in the western portion of the country will be within 30 miles of the eastern portion. (You'll need to account for cities on the edge of your divides, though.)

Upvotes: 1

DomPazz
DomPazz

Reputation: 12465

Gord mentioned this in the comments, just add a filter to the query and you can cut out the double counting and the calculation of distance to self.

PROC SQL; 
    CREATE TABLE Distance_Table_1 AS 
        SELECT 
             MASTER.PlaceID AS PlaceID 
            ,Master.INTPTLAT AS LAT1
            ,Master.INTPTLONG AS LONG1
            ,Match.INTPTLAT AS LAT2
            ,Match.INTPTLONG AS LONG2
            ,GEODIST(Master.INTPTLAT, Master.INTPTLONG, Match.INTPTLAT,Match.INTPTLONG,'M') AS DISTANCE
        FROM MASTER_TABLE_CLEANED_ MASTER
        CROSS JOIN MASTER_TABLE_CLEANED_ AS MATCH
        where match.PlaceID < master.PlaceID
        ; 
QUIT; 

Adding the where clause:

where match.PlaceID < master.PlaceID

That will return 449,985,000 records ( (n^2-n)/2 ). Hopefully that is small enough to handle.

(this takes 1:05 to run on a test table with 30k records on my laptop)

Upvotes: 1

Related Questions