Running geospatial queries in PySpark in Databricks

Question

I have PySpark dataframes with couple of columns, on of them being gps location (in WKT format). What is the easiest way to pick only rows that are inside some polygon? Does it scale when there are ~1B rows?

I'm using Azure Databricks and if the solution exists in Python, that would be even better, but Scala and SQl are also fine.

Edit: Alex Ott's answer - Mosaic - works and I find it easy to use.

Alex Ott · Accepted Answer

Databricks Labs includes the project Mosaic that is a library for processing of the geospatial data. And it's heavily optimized for Databricks.

This library provides the st_contains & st_intersects (doc) functions that could be used to find rows that are inside your polygons or other objects. That functions are available in all available languages - Scala, SQL, Python, R. For example, in SQL:

SELECT st_contains("POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))", 
                   "POINT (25 15)")

Running geospatial queries in PySpark in Databricks

Answers (2)

Related Questions