Alcibiades
Alcibiades

Reputation: 435

Running geospatial queries in PySpark in Databricks

I have PySpark dataframes with couple of columns, on of them being gps location (in WKT format). What is the easiest way to pick only rows that are inside some polygon? Does it scale when there are ~1B rows?

I'm using Azure Databricks and if the solution exists in Python, that would be even better, but Scala and SQl are also fine.

Edit: Alex Ott's answer - Mosaic - works and I find it easy to use.

Upvotes: 4

Views: 2967

Answers (2)

Alex Ott
Alex Ott

Reputation: 87204

Databricks Labs includes the project Mosaic that is a library for processing of the geospatial data. And it's heavily optimized for Databricks.

This library provides the st_contains & st_intersects (doc) functions that could be used to find rows that are inside your polygons or other objects. That functions are available in all available languages - Scala, SQL, Python, R. For example, in SQL:

SELECT st_contains("POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))", 
                   "POINT (25 15)")

Upvotes: 2

jdjd37
jdjd37

Reputation: 1

openai says:

I think you can use ST_Contains function. import pyspark.sql.functions as F

df.withColumn("is_inside", F.expr("ST_Contains(ST_GeomFromText('POLYGON((0 0, 0 1, 1 1, 1 0, 0 0))'), gps)")).where("is_inside").show()

Upvotes: -1

Related Questions