Athena geospatial SQL joins never complete

Question

A very basic geospatial join, based on this example, times out every time.

The table polygons contains 340K polygons, while points contains 5K rows with latitude/longitude pairs (and an ID). Both are single .csv files in S3.

Query:

SELECT poly.geometry, p.id
FROM polygons as poly
CROSS JOIN points as p
WHERE ST_CONTAINS (ST_POLYGON(poly.geometry), ST_POINT(p.lon, p.lat));

The SQL query above never completes in the default 30-minute Athena query time limit.

I've found vanilla Athena queries on large-ish data sets are fairly performant, but I'm not sure whether the geospatial functionality can handle (million x million) joins, let alone (billion x million) joins.

Thought I'd ask before moving back to EMR+Spark solution.

Piotr Findeisen · Accepted Answer

Athena service is based on Presto 0.172. That version did not have geospatial join, so your query if single-node cross-join + filter.

Newer Presto versions have the optimization so they can execute such query in a distributed manner.

Besides going back to EMR, you can try Starburst Presto for AWS which gives you easy to launch best Presto version prepackaged for AWS. (note: I work for Starburst)

Athena geospatial SQL joins never complete

Answers (2)

Related Questions