Reputation: 81
I am trying to find the number of places within a 30 mile radius for every place. For example, for Springfield, IL, how many towns are in a 30 mile radius?
For every place, I have the name, the latitude, and the longitude and n = 30k.
This problem would be relatively simple if the dataset were smaller:
PROC SQL;
CREATE TABLE Distance_Table_1 AS
SELECT
MASTER.PlaceID AS PlaceID
,Master.INTPTLAT AS LAT1
,Master.INTPTLONG AS LONG1
,Match.INTPTLAT AS LAT2
,Match.INTPTLONG AS LONG2
,GEODIST(Master.INTPTLAT, Master.INTPTLONG, Match.INTPTLAT,Match.INTPTLONG,'M') AS DISTANCE
FROM MASTER_TABLE_CLEANED_ MASTER
CROSS JOIN MASTER_TABLE_CLEANED_ AS MATCH
;
QUIT;
I would then create a count of all places within 30 miles for every place.
The problem is that this produces a ridiculously large table that my SAS cannot handle (900M rows).
How might I process this more efficiently?
Upvotes: 0
Views: 138
Reputation: 27508
Reduce the pair selection space.
Use DomPazz PlaceId to restrict the set of pairs to evaluate, and a lattice approach to require a 30 mile nearness approximation per lat and long values.
The pairwise selection criteria would be
where
fromCity.placeId < toCity.placeId
and abs(fromCity.lat - toCity.lat) < &precomputed_Max30mileLatRange
and abd(fromCity.lont - toCity.long) < &precomputed_Max30mileLongRange
Using the information at http://longitudestore.com/how-big-is-one-gps-degree.html:
So, presuming your map data does not have a place with latitude in excess of 80, the following selection criteria will largely reduce the pairings on which geodistance will need to be computed.
where
fromCity.placeId < toCity.placeId
and abs(fromCity.lat - toCity.lat) < 0.5 /* ~35 miles */
and abs(fromCity.lont - toCity.long) < 2.5 /* anywhere from ~36 miles (at 80 lat to ~175mi at equator */
This all presumes a spherical Earth belief.
Upvotes: 1
Reputation: 2243
When you're joining a table against itself, you want to try to accomplish a few things to help with performance:
See the problem? You're not reducing the dataset at all, and you're performing a complicated distance calculation 30k x 30k times. Instead of eliminating as many possible records as quickly as you can, you're going ahead and brute-forcing everything up front.
Right off the bat, one easy way of improving performance would be to do something like:
select *
from cities c1
JOIN cities c2
on c1.ID < c2.ID
and c2.Lat between c1.Lat - 30 miles and c1.Lat + 30 miles
and c2.Long between c1.Long - 30 miles and c1.Long + 30 miles
... this will get you a much smaller list of possible candidates. It's not the final answer - you'll have cities that are 25 miles north and 25 miles west of another, which is beyond a total 30 miles. But you've greatly reduced the total number of distance checks you need to do, as well as the dataset you're doing operations on.
After that, you should play around with indexes on the table. My guess is that you'd want an index that includes both the Lat and Long columns together, so that you only need a single index to perform the operation.
This should hopefully get you where you need to go - my guess is that this is all the optimization you'll need. But if you need to make things even faster, you can subdivide the data. After all, nothing in the western portion of the country will be within 30 miles of the eastern portion. (You'll need to account for cities on the edge of your divides, though.)
Upvotes: 1
Reputation: 12465
Gord mentioned this in the comments, just add a filter to the query and you can cut out the double counting and the calculation of distance to self.
PROC SQL;
CREATE TABLE Distance_Table_1 AS
SELECT
MASTER.PlaceID AS PlaceID
,Master.INTPTLAT AS LAT1
,Master.INTPTLONG AS LONG1
,Match.INTPTLAT AS LAT2
,Match.INTPTLONG AS LONG2
,GEODIST(Master.INTPTLAT, Master.INTPTLONG, Match.INTPTLAT,Match.INTPTLONG,'M') AS DISTANCE
FROM MASTER_TABLE_CLEANED_ MASTER
CROSS JOIN MASTER_TABLE_CLEANED_ AS MATCH
where match.PlaceID < master.PlaceID
;
QUIT;
Adding the where clause:
where match.PlaceID < master.PlaceID
That will return 449,985,000 records ( (n^2-n)/2 ). Hopefully that is small enough to handle.
(this takes 1:05 to run on a test table with 30k records on my laptop)
Upvotes: 1