Efficiently compute distance between two POINTS using geodesic and polars for large scale datasets

Question

I have created a dummy sample of ~10_000 rows with two columns (pikcup and dropoff location points - encoded as string).

I read the sample into a polars Dataframe with the following command:

df = pl.read_csv("./taxi_coordinates.csv")

I would like to efficiently compute the distance between those points using the module from geopy.distance import geodesic

Please note that I am trying to discover the most efficient approach because my original sample is over 30 million rows.

My approach using map_rows()

def compute_coordinates_v2(df:pl.DataFrame, col:str) -> pl.DataFrame:
    target_col:str = 'pu_polygon_centroid' if col == 'pickup' else 'do_polygon_centroid'
    location_data:str = f'{col}_location_cleaned'
    coordinates:str = f'{col}_coordinates'
    df = df.with_columns(
        pl.col(target_col).str.replace_all(r'POINT $|$', '').alias(location_data)
    ).with_columns(
        pl.col(location_data).str.split(' ').alias(coordinates)
    )
    return df

df = compute_coordinates_v2(df, 'pickup')
df = compute_coordinates_v2(df, 'dropoff')

The above operation will generate two columns of list type

shape: (5, 1)
┌───────────────────────────────────┐
│ pickup_coordinates                │
│ ---                               │
│ list[str]                         │
╞═══════════════════════════════════╡
│ ["-73.95701169835736", "40.78043… │
│ ["-73.95701169835736", "40.78043… │
│ ["-73.95701169835736", "40.78043… │
│ ["-73.9656345353807", "40.768615… │
│ ["-73.9924375369761", "40.748497… │
└───────────────────────────────────┘
shape: (5, 1)
┌───────────────────────────────────┐
│ dropoff_coordinates               │
│ ---                               │
│ list[str]                         │
╞═══════════════════════════════════╡
│ ["-73.9656345353807", "40.768615… │
│ ["-73.95701169835736", "40.78043… │
│ ["-73.95701169835736", "40.78043… │
│ ["-73.9924375369761", "40.748497… │
│ ["-74.007879708664", "40.7177727… │
└───────────────────────────────────┘

Now to compute the distance I use the following func

def compute_centroid_distance_v2(row):
    if (row[0][0]) and (row[0][1]) and (row[1][0]) and (row[1][1]):
        centroid_distance = geodesic(
            (row[0][1], row[0][0]), #(latitude, longitude)
            (row[1][1], row[1][0])
        ).kilometers
    else:
        centroid_distance = 0.0
    return centroid_distance

df = df.with_columns(
        df.select(["pickup_coordinates", "dropoff_coordinates"]).map_rows(compute_centroid_distance_v2).rename({'map': "centroid_distance"})
    )

On a benchmark of 30 million rows the map_rows took approximately 1.5 hours.

Obviously something like

df = df.with_columns(
        pl.col("pickup_coordinates").list.first().cast(pl.Float32).alias('pickup_longitude'),
        pl.col("pickup_coordinates").list.last().cast(pl.Float32).alias('pickup_latitude'),
        pl.col("dropoff_coordinates").list.first().cast(pl.Float32).alias('dropoff_longitude'),
        pl.col("dropoff_coordinates").list.last().cast(pl.Float32).alias('dropoff_latitude')
    ).with_columns(
        coords = geodesic( (pl.col("pickup_latitude"), pl.col('pickup_longitude')),  (pl.col("dropoff_latitude"), pl.col('dropoff_longitude'))).kilometers
    )

didn't work because polars tries to apply a logical operation on (pl.col("pickup_latitude"), pl.col('pickup_longitude')

Thus, I would like to understand if map_rows/map_elements is my only solution or if there is a different work-around that could speed up the computations.

jqurious · Accepted Answer

As in the answer from https://stackoverflow.com/a/76265233/ you can attempt to replicate geodesic() using Polars Expressions.

Another potential option could be DuckDB, which can input/output Polars DataFrames easily.

DuckDB has a SPATIAL extension: https://duckdb.org/2023/04/28/spatial.html

duckdb.sql("install spatial") # needed once

If I increase your example for a simple comparison:

df_big = pl.concat([df] * 10)

Using your map_rows approach:

(df_big
  .select("pickup_coordinates", "dropoff_coordinates")
  .map_rows(compute_centroid_distance_v2)
  .rename({"map": "centroid_distance"})
)

shape: (98_510, 1)
┌───────────────────┐
│ centroid_distance │
│ ---               │
│ f64               │
╞═══════════════════╡
│ 1.50107           │
│ 0.0               │
│ 0.0               │
│ 3.18019           │
│ 3.652772          │
│ …                 │
│ 2.376629          │
│ 1.440797          │
│ 4.583181          │
│ 0.53954           │
│ 2.589928          │
└───────────────────┘

Elapsed time: 4.52725 seconds

Using duckdb:

duckdb.sql("load spatial")

duckdb.sql("""
from df_big
select
   st_distance_spheroid(
      st_point(
         pickup_coordinates[2]::float, -- NOTE: array indexing is 1-based
         pickup_coordinates[1]::float
      ),
      st_point(
         dropoff_coordinates[2]::float,
         dropoff_coordinates[1]::float
      )
   ) as geodesic
""") .pl()

shape: (98_510, 1)
┌─────────────┐
│ geodesic    │
│ ---         │
│ f64         │
╞═════════════╡
│ 1501.364    │
│ 0.0         │
│ 0.0         │
│ 3180.189287 │
│ 3652.673199 │
│ …           │
│ 2376.786018 │
│ 1440.740571 │
│ 4583.039701 │
│ 539.144276  │
│ 2590.085087 │
└─────────────┘

Elapsed time: 0.10821 seconds

I don't know too much about spatial data, so I'm not entirely sure why there are slight differences in the outputs.

It also seems like you could use st_read() to load the data directly into duckdb instead of having to manually chop it up with Polars first.

Efficiently compute distance between two POINTS using geodesic and polars for large scale datasets

Answers (2)

Related Questions