Reputation: 101
A very basic geospatial join, based on this example, times out every time.
The table polygons
contains 340K polygons, while points
contains 5K rows with latitude/longitude pairs (and an ID). Both are single .csv files in S3.
Query:
SELECT poly.geometry, p.id
FROM polygons as poly
CROSS JOIN points as p
WHERE ST_CONTAINS (ST_POLYGON(poly.geometry), ST_POINT(p.lon, p.lat));
The SQL query above never completes in the default 30-minute Athena query time limit.
I've found vanilla Athena queries on large-ish data sets are fairly performant, but I'm not sure whether the geospatial functionality can handle (million x million) joins, let alone (billion x million) joins.
Thought I'd ask before moving back to EMR+Spark solution.
Upvotes: 1
Views: 592
Reputation: 712
Try to use pre filter in polygons table like this
... where ST_Buffer(ST_GeometryFromText( shape_wkt) , 5000.0 * 360.0 / (2.0 * pi() * cos( radians(latitud) ) * 6400000.0) ))
5000.0 = radius distance in meters
Upvotes: 0
Reputation: 20770
Athena service is based on Presto 0.172. That version did not have geospatial join, so your query if single-node cross-join + filter.
Newer Presto versions have the optimization so they can execute such query in a distributed manner.
Besides going back to EMR, you can try Starburst Presto for AWS which gives you easy to launch best Presto version prepackaged for AWS. (note: I work for Starburst)
Upvotes: 3