Reputation: 435
I have PySpark dataframes with couple of columns, on of them being gps location (in WKT format). What is the easiest way to pick only rows that are inside some polygon? Does it scale when there are ~1B rows?
I'm using Azure Databricks and if the solution exists in Python, that would be even better, but Scala and SQl are also fine.
Edit: Alex Ott's answer - Mosaic - works and I find it easy to use.
Upvotes: 4
Views: 2967
Reputation: 87204
Databricks Labs includes the project Mosaic that is a library for processing of the geospatial data. And it's heavily optimized for Databricks.
This library provides the st_contains
& st_intersects
(doc) functions that could be used to find rows that are inside your polygons or other objects. That functions are available in all available languages - Scala, SQL, Python, R. For example, in SQL:
SELECT st_contains("POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))",
"POINT (25 15)")
Upvotes: 2
Reputation: 1
openai says:
I think you can use ST_Contains
function.
import pyspark.sql.functions as F
df.withColumn("is_inside", F.expr("ST_Contains(ST_GeomFromText('POLYGON((0 0, 0 1, 1 1, 1 0, 0 0))'), gps)")).where("is_inside").show()
Upvotes: -1