Kevin White
Kevin White

Reputation: 173

best way to store a symmetric matrix of data nxn = 2.6 billion

i have postgresql with the postgis extension installed and a data table of zipcodes with lat/long as a point field. i wish to return zips within a variable distance of some zip, like

return all zips within x miles of zip 12345

there are about 51,000 zipcodes. precomputing all would allow for lookups without computation. right now i'm doing comps on-the-fly. the computed data could be arranged in a symmetric matrix.

i was thinking of this solution:

if we accept that the distance of a zip from itself is implied to be zero, then i could load a table with n^2/2-n rows (about 1.3 billion rows), with columns z1 z2 d, and then do a compound index on z1+d to return my queryset containing the list of z2.

my question is how would you handle it for efficient return on-the-fly. possibly abandon sql after all distance calculations? leave it how i have it doing comps at query time? i don't care too much about complete distance computation time or indexing time. i'd do these annually, or at most quarterly. storage might also be a concern?

Upvotes: 2

Views: 869

Answers (3)

John Powell
John Powell

Reputation: 12571

Postgres/PostGIS spatial indexes are designed to do exactly this kind of search. They are based on R-trees, http://en.wikipedia.org/wiki/R_tree, which essentially subdivide your spatial data into boxes, ie, it is a 2-dimensional. There is a function, ST_DWithin, which will return all the geometries within distance x, of some other geometry. So, given a table of zip codes and points (called geom) representing lat/long locations, you can write queries, such as,

select zip, geom from zipcodes z, 
  (select geom from zipcodes where zip=12345) s 
where ST_DWithin(s.geom, z.geom, 10000)
  order by ST_Distance(s.geom, z.geom) limit 5;

which will return the nearest 5 zip codes within 10km of the zip code 12345.

As you can index both the zip code and the geometry field very efficiently, it would be unnecesary, imho, to store a matrix of all the possible distances, as spatial indexes perform well with tens of millions of rows.

Creating a spatial index in Posgis is as easy as;

create index ix_spatial_zips on zipcodes using gist(geom);

I realize that this doesn't answer you original question exactly, but this means you will only need to store 51,000 rows, rather than the cartesian product of that number, and the performance will be better too.

Upvotes: 1

Joe Love
Joe Love

Reputation: 5932

Have you considered using EarthDistance? Within it, you can index "boxes" which are areas that basically "square off" your search area instead of it being round, so it can be indexed easier.. then, within your query, you also include a "radius" type query that eliminates the extra results returned using the box method.

http://www.postgresql.org/docs/9.2/static/earthdistance.html

Upvotes: 1

Thorsten Kettner
Thorsten Kettner

Reputation: 94884

That's an interesting question. I think an rdbms perfect for this task. No need to abandon it.

As to storing pre-computed distances: I would only do this if really needed, i.e. if you have performance issues. After all it's redundant data that must be maintained. If you decide for such a table, I agree with Vesper; store all n^2 rows, for otherwise you will always have to combine two queries; one to look up your zip code in z1, one to look it up in z2.

But maybe you can speed up your existing query. I don't know how you went about it. I remember the formula for distances to be quite complicated. So what I would do is to calculate the extreme latitudes and longitudes being within the desired range first (i.e. if I stay in the same latitude, what are the minimum and maximum longitudes still in that range; if I stay in the same longitude, what are the minimum and maximum latitudes). With the values calculated you can select all zip codes in that rectangle with BETWEEN (so indexes on longitude and latitude might come handy) and then only use the exact formula on the records thus found.

EDIT: I have given it more thought. If this database only exists for the task you describe, then yes, why not have another table for this particular purpose. You are right to mention storage. This table will need several GB and the index will take a lot of space, too. But with enough hard disk space available, this should be no problem.

Upvotes: 1

Related Questions