jonathan
jonathan

Reputation: 101

Athena geospatial SQL joins never complete

A very basic geospatial join, based on this example, times out every time.

The table polygons contains 340K polygons, while points contains 5K rows with latitude/longitude pairs (and an ID). Both are single .csv files in S3.

Query:

SELECT poly.geometry, p.id
FROM polygons as poly
CROSS JOIN points as p
WHERE ST_CONTAINS (ST_POLYGON(poly.geometry), ST_POINT(p.lon, p.lat));

The SQL query above never completes in the default 30-minute Athena query time limit.

I've found vanilla Athena queries on large-ish data sets are fairly performant, but I'm not sure whether the geospatial functionality can handle (million x million) joins, let alone (billion x million) joins.

Thought I'd ask before moving back to EMR+Spark solution.

Upvotes: 1

Views: 592

Answers (2)

Try to use pre filter in polygons table like this

... where ST_Buffer(ST_GeometryFromText( shape_wkt) , 5000.0 * 360.0 / (2.0 * pi() * cos( radians(latitud) ) * 6400000.0) ))

5000.0 = radius distance in meters

Upvotes: 0

Piotr Findeisen
Piotr Findeisen

Reputation: 20770

Athena service is based on Presto 0.172. That version did not have geospatial join, so your query if single-node cross-join + filter.

Newer Presto versions have the optimization so they can execute such query in a distributed manner.

Besides going back to EMR, you can try Starburst Presto for AWS which gives you easy to launch best Presto version prepackaged for AWS. (note: I work for Starburst)

Upvotes: 3

Related Questions