How to efficiently query large spatial linestring files using duckdb?

Question

Goal

I am developing an app that takes a road name from a user, looks for all the roads with similar names in North America, and then returns a plot of all those roads. My goal is to use duckdb for querying the large spatial files (.gpkg) of linestrings for getting the sf data as fast as possible.

Example

I downloaded Canada .pbf file from geofabrik website. Reading the linestrings layer created a new .gpkg file in the same location:

library(osmextract)
osm_lines <- oe_read("data/geofabrik_canada-latest.osm.pbf", 
                    layer = "lines")

Then I used the duckdbfs and duckplyr packages for querying:

library(duckplyr)

system.time({
  foo <- duckdbfs::open_dataset(here::here("data/geofabrik_canada-latest.gpkg")) |> 
    select(name, highway, geometry) 
})

# user  system elapsed 
# 1.11    0.21    0.88 

road_name <- "Ontario"
num_of_rows <- 5

system.time({
  foo2 <- foo |>  
    filter(stringr::str_detect(name, road_name)) |> 
    slice_sample(n = num_of_rows)
})
# user  system elapsed 
# 0.01    0.00    0.01 

foo2 |> 
  select(name, highway)

system.time({
  fooz <- foo2 |> 
    duckdbfs::to_sf()
})
# user  system elapsed 
# 12.83    9.34    7.50

Question

Is there a faster way to query than this?

margusl · Accepted Answer

There are few aspects that make it a rather interesting and perhaps somewhat confusing topic.

We have:

duckplyr, which translates dplyr verbs to duckdb translational API, skipping SQL and dbplyr; but it does attach dplyr, so all verbs appear to work.
duckdbfs, which returns lazy dplyr::tbl objects that are translated to SQL and at some point materialized, through dbplyr

So is duckplyr actually used here?
As we are starting with a lazy dplyr::tbl returned by duckdbfs::open_dataset(), duckplyr just stands by and all dplyr verbs (select, filter, slice_sample) and other implemented functions (str_detect) are handled and translated to SQL by dbplyr instead. Which kind of works in your favor as translations for str_detect() nor slice() & friends are not yet implemented in duckplyr (in-dev doc: [1], [2]), so it would need to fall back to dplyr anyway, which in turn would pull all records from duckdb, which happens to act as proxy for complete dpkg dataset, into R.

To optimize your current approach a bit, we could pre-filter that dataset by name & highway values (as you only query non-null values, that's 1.6M out of 16.6M total records) and store results in a persistent DuckDB database.

library(dplyr)
library(duckdbfs)

lines_gpkg <- "D:/geofabrik/geofabrik_canada-latest.gpkg"
(lines_duckdb <- fs::path_ext_set(lines_gpkg, "duckdb"))
#> D:/geofabrik/geofabrik_canada-latest.duckdb

# switch duckdb from default in-memory mode to a a persistent file-mode
con <- duckdbfs::cached_connection(lines_duckdb)

# open_dataset creates view to access gpkg; 
# select only required columns and records where both `name` & `highway` are set
# materialize result as a new `named_hw` table in duckdb, create index for `name`;
# how useful index would be depends on how and what you are matching
# even even not using exact matches, e.g. DuckDB is smart enough to
# turn something like `REGEXP_MATCHES(name, '^Ontario')` into
# `prefix(name, "Ontario")` which definitely gains from index
named_highways <- 
  open_dataset(
    sources = lines_gpkg,
    con = duckdbfs::cached_connection(lines_duckdb),
    tblname = "lines_view"
  ) |> 
  select(name, highway, geometry) |> 
  filter(!is.na(name), !is.na(highway)) |> 
  compute(name = "named_hw", indexes = list("name"), temporary = FALSE, overwrite = TRUE)

close_connection()
# compare sizes of .gpkg and resulting .duckdb
fs::file_info(c(lines_gpkg, lines_duckdb))[,1:3]
#> # A tibble: 2 × 3
#>   path                                        type         size
#>                                      
#> 1 D:/geofabrik/geofabrik_canada-latest.gpkg   file        7.22G
#> 2 D:/geofabrik/geofabrik_canada-latest.duckdb file      591.51M

And in you next R sessions, use that newly create duckdb database for queries:

library(dplyr)
library(duckdbfs)

lines_duckdb <- "D:/geofabrik/geofabrik_canada-latest.duckdb"
# as we'll use duckdbfs::to_sf(), we might as well use 
# duckdbfs::cached_connection() in place of DBI::dbConnect(duckdb::duckdb(dbdir = lines_duckdb))
named_highways <- tbl(cached_connection(lines_duckdb), "named_hw")

# cold start, next queries are faster
system.time({
  highway_sample_sf <- 
    named_highways |> 
    filter(str_detect(name, "Ontario")) |> 
    slice_sample(n = 5) |> 
    to_sf(crs = "WGS84")
})
#>    user  system elapsed 
#>    0.92    0.64    0.70
highway_sample_sf
#> Simple feature collection with 5 features and 2 fields
#> Geometry type: LINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: -81.76428 ymin: 43.21599 xmax: -73.53746 ymax: 45.55634
#> Geodetic CRS:  WGS 84
#>                   name     highway                           geom
#> 1      Rue Ontario Est     footway LINESTRING (-73.53746 45.55...
#> 2 Ontario Street South     primary LINESTRING (-81.76428 43.30...
#> 3        Place Ontario     footway LINESTRING (-73.58051 45.49...
#> 4       Ontario Street residential LINESTRING (-78.67769 43.90...
#> 5       Ontario Street residential LINESTRING (-81.30096 43.21...

While it would work with your example queries, whether throwing away most of the data like this works for your final application is another question. And if you do plan to use similar sampling aproach, you probably have noticed by now that individual lines are mostly just small sections of streets and other highway-type features like hiking trails.

To get a better idea why it originally was so slow, we could enable DuckDB query logging and profiling to trace what duckdbfs::open_dataset() actually does (though, it's all well documented) and how materialization shows up in a query plan. With just standard show_query() & explain() we would not able to capture queries that are executed within duckdbfs functions.

Set up DuckDB logging and profiling

Queries are logged to ddb_log and last query plan is strored in ddb_prof file.

con <- duckdbfs::cached_connection()
ddb_log  <- fs::file_temp("duckdb", ext = ".log")
ddb_prof <- fs::file_temp("duckdb", ext = ".profile")

stringr::str_glue(
"PRAGMA enable_profiling = 'query_tree';
PRAGMA profiling_output = '{ddb_prof}';
PRAGMA log_query_path   = '{ddb_log}';") |> 
DBI::dbExecute(conn = con, statement = _)

Previous workflow with profile capture

SQL in comments is from ddb_log file to show actual queries that were performed in DuckDB.

library(duckplyr)
#> Loading required package: dplyr
#> /../
lines_gpkg <- "D:/geofabrik/geofabrik_canada-latest.gpkg"

# open_dataset loads duckdb spatial extension, 
# creates a view to read gpkg and returns lazy tbl;
# no actual materializations takes place, 
# use of select verb is just recorded and at some point 
# will be used to generate SQL 
system.time({
  foo <- duckdbfs::open_dataset(lines_gpkg) |> 
    select(name, highway, geometry) 
})
#>    user  system elapsed 
#>    0.41    0.46    1.90
# relevant queries from logs made by open_dataset:
#  LOAD 'spatial';
#  CREATE OR REPLACE TEMPORARY VIEW mdivdyhqkejpsxv AS SELECT * FROM st_read('D:/geofabrik/geofabrik_canada-latest.gpkg');

road_name <- "Ontario"
num_of_rows <- 5

# no queries are made, just tracking the use of verbs (filter, slice_sample) 
# and str_detect function
system.time({
  foo2 <- foo |>  
    filter(stringr::str_detect(name, road_name)) |> 
    slice_sample(n = num_of_rows)
})
#>    user  system elapsed 
#>    0.02    0.00    0.05

# print method triggers partial materialization:
# dbplyr translates known functions and 
# builds, optimises, renders verbs in lazy tbl to SQL, 
# a small subset (LIMIT 11) is fetched
system.time({
  foo2 |> 
    select(name, highway) |> 
    print()
})
#> # Source:   SQL [?? x 2]
#> # Database: DuckDB v1.1.3 [margus@Windows 10 x64:R 4.4.2/:memory:]
#>   name            highway    
#>                    
#> 1 Ontario Street  residential
#> 2 Ontario Street  secondary  
#> 3 Ontario Street  residential
#> 4 Rue Ontario Est footway    
#> 5 Rue Ontario Est footway
#>    user  system elapsed 
#>    7.29   10.00    6.44
#  SELECT "name", highway
#  FROM (
#    SELECT
#      "name",
#      highway,
#      geometry,
#      ROW_NUMBER() OVER (ORDER BY RANDOM()) AS col01
#    FROM mdivdyhqkejpsxv
#    WHERE (REGEXP_MATCHES("name", 'Ontario'))
#  ) q01
#  WHERE (col01 <= 5)
#  LIMIT 11

# full materialization in to_sf, load - subset - sample
system.time({
  fooz <- foo2 |> 
    duckdbfs::to_sf()
})
#>    user  system elapsed 
#>   17.49   12.82    8.68
#  SELECT "name", highway, ST_AsWKB(geom) AS geom
#  FROM (
#    SELECT "name", highway, geometry AS geom
#    FROM (
#      SELECT
#        "name",
#        highway,
#        geometry,
#        ROW_NUMBER() OVER (ORDER BY RANDOM()) AS col01
#      FROM mdivdyhqkejpsxv
#      WHERE (REGEXP_MATCHES("name", 'Ontario'))
#    ) q01
#    WHERE (col01 <= 5)
#  ) q01

So duckdbfs::open_dataset() by default does not really load anything from GPKG, it just loads spatial extension in DuckDB in-memory instance, creates a view to access GPKG file and returns lazy tbl pointing to that view. Computation starts when something needs to be pulled into R, for example if you decide to trigger a print method (e.g. with foo2 |> select(name, highway) ), a small subset gets fetched. Or when you call to_sf().

Whole process (load - filter - sample) is restarted every time you (re-)trigger materialization of foo2; and loading here is the costly bit.

Final DuckDB execution plan

One thing to note here is how dbplyr translated str_detect to REGEXP_MATCHES, which then is optimized by DuckDB to match used pattern , here it switches to plain text search with contains() , with "^Ontario" it would use prefix(), for example.
Slowest step is TABLE_SCAN / ST_READ, i.e. loading. 65.44s might seem somewhat misleading, but it's cumulative CPU time value for all (8) threads combined, so it more or less aligns with
TotalTime x 8

readr::read_file(ddb_prof) |> cat()
#> ┌─────────────────────────────────────┐
#> │┌───────────────────────────────────┐│
#> ││    Query Profiling Information    ││
#> │└───────────────────────────────────┘│
#> └─────────────────────────────────────┘
#> SELECT "name", highway, ST_AsWKB(geom) AS geom FROM (   SELECT "name", highway, geometry AS geom   FROM (     SELECT       "name",       highway,       geometry,       ROW_NUMBER() OVER (ORDER BY RANDOM()) AS col01     FROM ukcvyahgwidimrk     WHERE (REGEXP_MATCHES("name", 'Ontario'))   ) q01   WHERE (col01 <= 5) ) q01
#> ┌────────────────────────────────────────────────┐
#> │┌──────────────────────────────────────────────┐│
#> ││               Total Time: 8.27s              ││
#> │└──────────────────────────────────────────────┘│
#> └────────────────────────────────────────────────┘
#> ┌───────────────────────────┐
#> │           QUERY           │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │    ────────────────────   │
#> │            name           │
#> │          highway          │
#> │            geom           │
#> │                           │
#> │           5 Rows          │
#> │          (0.00s)          │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │    ────────────────────   │
#> │            name           │
#> │          highway          │
#> │          geometry         │
#> │                           │
#> │           5 Rows          │
#> │          (0.00s)          │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │           FILTER          │
#> │    ────────────────────   │
#> │        (col01 <= 5)       │
#> │                           │
#> │           5 Rows          │
#> │          (0.00s)          │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │    ────────────────────   │
#> │             #0            │
#> │             #1            │
#> │             #2            │
#> │             #3            │
#> │                           │
#> │         1282 Rows         │
#> │          (0.00s)          │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │           WINDOW          │
#> │    ────────────────────   │
#> │        Projections:       │
#> │  ROW_NUMBER() OVER (ORDER │
#> │    BY random() ASC NULLS  │
#> │            LAST)          │
#> │                           │
#> │         1282 Rows         │
#> │          (0.01s)          │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │           FILTER          │
#> │    ────────────────────   │
#> │ contains(name, 'Ontario') │
#> │                           │
#> │         1282 Rows         │
#> │          (0.19s)          │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │         TABLE_SCAN        │
#> │    ────────────────────   │
#> │     Function: ST_READ     │
#> │                           │
#> │        Projections:       │
#> │            name           │
#> │          highway          │
#> │          geometry         │
#> │                           │
#> │       16643024 Rows       │
#> │          (65.44s)         │
#> └───────────────────────────┘