Reputation: 6199
I am developing an app that takes a road name from a user, looks for all the roads with similar names in North America, and then returns a plot of all those roads. My goal is to use duckdb for querying the large spatial files (.gpkg
) of linestrings for getting the sf
data as fast as possible.
I downloaded Canada .pbf
file from geofabrik website. Reading the linestrings layer created a new .gpkg
file in the same location:
library(osmextract)
osm_lines <- oe_read("data/geofabrik_canada-latest.osm.pbf",
layer = "lines")
Then I used the duckdbfs
and duckplyr
packages for querying:
library(duckplyr)
system.time({
foo <- duckdbfs::open_dataset(here::here("data/geofabrik_canada-latest.gpkg")) |>
select(name, highway, geometry)
})
# user system elapsed
# 1.11 0.21 0.88
road_name <- "Ontario"
num_of_rows <- 5
system.time({
foo2 <- foo |>
filter(stringr::str_detect(name, road_name)) |>
slice_sample(n = num_of_rows)
})
# user system elapsed
# 0.01 0.00 0.01
foo2 |>
select(name, highway)
system.time({
fooz <- foo2 |>
duckdbfs::to_sf()
})
# user system elapsed
# 12.83 9.34 7.50
Is there a faster way to query than this?
Upvotes: 0
Views: 168
Reputation: 17544
There are few aspects that make it a rather interesting and perhaps somewhat confusing topic.
We have:
duckplyr
, which translates dplyr verbs to duckdb translational API, skipping SQL and dbplyr
; but it does attach dplyr
, so all verbs appear to work.duckdbfs
, which returns lazy dplyr::tbl
objects that are translated to SQL and at some point materialized, through dbplyr
So is duckplyr
actually used here?
As we are starting with a lazy dplyr::tbl
returned by duckdbfs::open_dataset()
, duckplyr
just stands by and all dplyr verbs (select
, filter
, slice_sample
) and other implemented functions (str_detect
) are handled and translated to SQL by dbplyr
instead. Which kind of works in your favor as translations for str_detect()
nor slice()
& friends are not yet implemented in duckplyr
(in-dev doc: [1], [2]), so it would need to fall back to dplyr
anyway, which in turn would pull all records from duckdb, which happens to act as proxy for complete dpkg dataset, into R.
To optimize your current approach a bit, we could pre-filter that dataset by name & highway values (as you only query non-null values, that's 1.6M out of 16.6M total records) and store results in a persistent DuckDB database.
library(dplyr)
library(duckdbfs)
lines_gpkg <- "D:/geofabrik/geofabrik_canada-latest.gpkg"
(lines_duckdb <- fs::path_ext_set(lines_gpkg, "duckdb"))
#> D:/geofabrik/geofabrik_canada-latest.duckdb
# switch duckdb from default in-memory mode to a a persistent file-mode
con <- duckdbfs::cached_connection(lines_duckdb)
# open_dataset creates view to access gpkg;
# select only required columns and records where both `name` & `highway` are set
# materialize result as a new `named_hw` table in duckdb, create index for `name`;
# how useful index would be depends on how and what you are matching
# even even not using exact matches, e.g. DuckDB is smart enough to
# turn something like `REGEXP_MATCHES(name, '^Ontario')` into
# `prefix(name, "Ontario")` which definitely gains from index
named_highways <-
open_dataset(
sources = lines_gpkg,
con = duckdbfs::cached_connection(lines_duckdb),
tblname = "lines_view"
) |>
select(name, highway, geometry) |>
filter(!is.na(name), !is.na(highway)) |>
compute(name = "named_hw", indexes = list("name"), temporary = FALSE, overwrite = TRUE)
close_connection()
# compare sizes of .gpkg and resulting .duckdb
fs::file_info(c(lines_gpkg, lines_duckdb))[,1:3]
#> # A tibble: 2 × 3
#> path type size
#> <fs::path> <fct> <fs::bytes>
#> 1 D:/geofabrik/geofabrik_canada-latest.gpkg file 7.22G
#> 2 D:/geofabrik/geofabrik_canada-latest.duckdb file 591.51M
And in you next R sessions, use that newly create duckdb database for queries:
library(dplyr)
library(duckdbfs)
lines_duckdb <- "D:/geofabrik/geofabrik_canada-latest.duckdb"
# as we'll use duckdbfs::to_sf(), we might as well use
# duckdbfs::cached_connection() in place of DBI::dbConnect(duckdb::duckdb(dbdir = lines_duckdb))
named_highways <- tbl(cached_connection(lines_duckdb), "named_hw")
# cold start, next queries are faster
system.time({
highway_sample_sf <-
named_highways |>
filter(str_detect(name, "Ontario")) |>
slice_sample(n = 5) |>
to_sf(crs = "WGS84")
})
#> user system elapsed
#> 0.92 0.64 0.70
highway_sample_sf
#> Simple feature collection with 5 features and 2 fields
#> Geometry type: LINESTRING
#> Dimension: XY
#> Bounding box: xmin: -81.76428 ymin: 43.21599 xmax: -73.53746 ymax: 45.55634
#> Geodetic CRS: WGS 84
#> name highway geom
#> 1 Rue Ontario Est footway LINESTRING (-73.53746 45.55...
#> 2 Ontario Street South primary LINESTRING (-81.76428 43.30...
#> 3 Place Ontario footway LINESTRING (-73.58051 45.49...
#> 4 Ontario Street residential LINESTRING (-78.67769 43.90...
#> 5 Ontario Street residential LINESTRING (-81.30096 43.21...
While it would work with your example queries, whether throwing away most of the data like this works for your final application is another question. And if you do plan to use similar sampling aproach, you probably have noticed by now that individual lines are mostly just small sections of streets and other highway-type features like hiking trails.
To get a better idea why it originally was so slow, we could enable DuckDB query logging and profiling to trace what duckdbfs::open_dataset()
actually does (though, it's all well documented) and how materialization shows up in a query plan.
With just standard show_query()
& explain()
we would not able to capture queries that are executed within duckdbfs
functions.
Queries are logged to ddb_log
and last query plan is strored in ddb_prof
file.
con <- duckdbfs::cached_connection()
ddb_log <- fs::file_temp("duckdb", ext = ".log")
ddb_prof <- fs::file_temp("duckdb", ext = ".profile")
stringr::str_glue(
"PRAGMA enable_profiling = 'query_tree';
PRAGMA profiling_output = '{ddb_prof}';
PRAGMA log_query_path = '{ddb_log}';") |>
DBI::dbExecute(conn = con, statement = _)
SQL in comments is from ddb_log
file to show actual queries that were performed in DuckDB.
library(duckplyr)
#> Loading required package: dplyr
#> /../
lines_gpkg <- "D:/geofabrik/geofabrik_canada-latest.gpkg"
# open_dataset loads duckdb spatial extension,
# creates a view to read gpkg and returns lazy tbl;
# no actual materializations takes place,
# use of select verb is just recorded and at some point
# will be used to generate SQL
system.time({
foo <- duckdbfs::open_dataset(lines_gpkg) |>
select(name, highway, geometry)
})
#> user system elapsed
#> 0.41 0.46 1.90
# relevant queries from logs made by open_dataset:
# LOAD 'spatial';
# CREATE OR REPLACE TEMPORARY VIEW mdivdyhqkejpsxv AS SELECT * FROM st_read('D:/geofabrik/geofabrik_canada-latest.gpkg');
road_name <- "Ontario"
num_of_rows <- 5
# no queries are made, just tracking the use of verbs (filter, slice_sample)
# and str_detect function
system.time({
foo2 <- foo |>
filter(stringr::str_detect(name, road_name)) |>
slice_sample(n = num_of_rows)
})
#> user system elapsed
#> 0.02 0.00 0.05
# print method triggers partial materialization:
# dbplyr translates known functions and
# builds, optimises, renders verbs in lazy tbl to SQL,
# a small subset (LIMIT 11) is fetched
system.time({
foo2 |>
select(name, highway) |>
print()
})
#> # Source: SQL [?? x 2]
#> # Database: DuckDB v1.1.3 [margus@Windows 10 x64:R 4.4.2/:memory:]
#> name highway
#> <chr> <chr>
#> 1 Ontario Street residential
#> 2 Ontario Street secondary
#> 3 Ontario Street residential
#> 4 Rue Ontario Est footway
#> 5 Rue Ontario Est footway
#> user system elapsed
#> 7.29 10.00 6.44
# SELECT "name", highway
# FROM (
# SELECT
# "name",
# highway,
# geometry,
# ROW_NUMBER() OVER (ORDER BY RANDOM()) AS col01
# FROM mdivdyhqkejpsxv
# WHERE (REGEXP_MATCHES("name", 'Ontario'))
# ) q01
# WHERE (col01 <= 5)
# LIMIT 11
# full materialization in to_sf, load - subset - sample
system.time({
fooz <- foo2 |>
duckdbfs::to_sf()
})
#> user system elapsed
#> 17.49 12.82 8.68
# SELECT "name", highway, ST_AsWKB(geom) AS geom
# FROM (
# SELECT "name", highway, geometry AS geom
# FROM (
# SELECT
# "name",
# highway,
# geometry,
# ROW_NUMBER() OVER (ORDER BY RANDOM()) AS col01
# FROM mdivdyhqkejpsxv
# WHERE (REGEXP_MATCHES("name", 'Ontario'))
# ) q01
# WHERE (col01 <= 5)
# ) q01
So duckdbfs::open_dataset()
by default does not really load anything from GPKG, it just loads spatial
extension in DuckDB in-memory instance, creates a view to access GPKG file and returns lazy tbl
pointing to that view. Computation starts when something needs to be pulled into R, for example if you decide to trigger a print method (e.g. with foo2 |> select(name, highway)
), a small subset gets fetched. Or when you call to_sf()
.
Whole process (load - filter - sample) is restarted every time you (re-)trigger materialization of foo2
; and loading here is the costly bit.
One thing to note here is how dbplyr
translated str_detect
to REGEXP_MATCHES
, which then is optimized by DuckDB to match used pattern , here it switches to plain text search with contains()
, with "^Ontario"
it would use prefix()
, for example.
Slowest step is TABLE_SCAN
/ ST_READ
, i.e. loading. 65.44s might seem somewhat misleading, but it's cumulative CPU time value for all (8) threads combined, so it more or less aligns with
TotalTime x 8
readr::read_file(ddb_prof) |> cat()
#> ┌─────────────────────────────────────┐
#> │┌───────────────────────────────────┐│
#> ││ Query Profiling Information ││
#> │└───────────────────────────────────┘│
#> └─────────────────────────────────────┘
#> SELECT "name", highway, ST_AsWKB(geom) AS geom FROM ( SELECT "name", highway, geometry AS geom FROM ( SELECT "name", highway, geometry, ROW_NUMBER() OVER (ORDER BY RANDOM()) AS col01 FROM ukcvyahgwidimrk WHERE (REGEXP_MATCHES("name", 'Ontario')) ) q01 WHERE (col01 <= 5) ) q01
#> ┌────────────────────────────────────────────────┐
#> │┌──────────────────────────────────────────────┐│
#> ││ Total Time: 8.27s ││
#> │└──────────────────────────────────────────────┘│
#> └────────────────────────────────────────────────┘
#> ┌───────────────────────────┐
#> │ QUERY │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ──────────────────── │
#> │ name │
#> │ highway │
#> │ geom │
#> │ │
#> │ 5 Rows │
#> │ (0.00s) │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ──────────────────── │
#> │ name │
#> │ highway │
#> │ geometry │
#> │ │
#> │ 5 Rows │
#> │ (0.00s) │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ FILTER │
#> │ ──────────────────── │
#> │ (col01 <= 5) │
#> │ │
#> │ 5 Rows │
#> │ (0.00s) │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ──────────────────── │
#> │ #0 │
#> │ #1 │
#> │ #2 │
#> │ #3 │
#> │ │
#> │ 1282 Rows │
#> │ (0.00s) │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ WINDOW │
#> │ ──────────────────── │
#> │ Projections: │
#> │ ROW_NUMBER() OVER (ORDER │
#> │ BY random() ASC NULLS │
#> │ LAST) │
#> │ │
#> │ 1282 Rows │
#> │ (0.01s) │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ FILTER │
#> │ ──────────────────── │
#> │ contains(name, 'Ontario') │
#> │ │
#> │ 1282 Rows │
#> │ (0.19s) │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ TABLE_SCAN │
#> │ ──────────────────── │
#> │ Function: ST_READ │
#> │ │
#> │ Projections: │
#> │ name │
#> │ highway │
#> │ geometry │
#> │ │
#> │ 16643024 Rows │
#> │ (65.44s) │
#> └───────────────────────────┘
Upvotes: 2