Reputation: 10222
I have a targets pipeline which should fill and use a DBI database (in this case duckdb. Both SQL as well as duckdb are chosen for performance reasons).
How can I effectively use {targets}
in combination with an SQL table.
A minimal, reproducible example would look like this, which reads two datasets, combines them, creates a model, and a plot.
# in _targets.R
library(targets)
# preparation once:
# d <- ggplot2::mpg |> dplyr::mutate(id = 1:dplyr::n())
# write_csv(d |> dplyr::select(id, cty), "data1.csv")
# write_csv(d |> dplyr::select(id, displ, cyl), "data2.csv")
get_data <- function(file) {
read_csv(file, col_types = cols()) %>%
as_tibble()
}
combine_data <- function(d1, d2) {
left_join(d1, d2, by = "id")
}
fit_model <- function(data) {
lm(cty ~ displ + cyl, data) |>
coefficients()
}
plot_model <- function(model, data) {
ggplot(data) +
geom_point(aes(x = displ, y = cty, color = cyl)) +
geom_abline(intercept = model[1], slope = model[2]) +
theme_gray(24)
}
tar_option_set(packages = c("tibble", "readr", "dplyr", "ggplot2"))
list(
tar_target(file1, "data1.csv", format = "file"),
tar_target(file2, "data2.csv", format = "file"),
tar_target(data1, get_data(file1)),
tar_target(data2, get_data(file2)),
tar_target(data, combine_data(data1, data2)),
tar_target(model, fit_model(data)),
tar_target(plot, plot_model(model, data))
)
# run: tar_make() -> runs all targets
# change a value in data2.csv -> tar_make() -> rebuilds only dependencies
This should then be used with this connection to a database
# write to Database
con <- DBI::dbConnect(duckdb::duckdb(), "data.db")
on.exit(DBI::dbDisconnect(con, shutdown = TRUE), add = TRUE)
DBI::dbWriteTable(con, "data1", d |> dplyr::select(id, cty))
DBI::dbWriteTable(con, "data2", d |> dplyr::select(id, displ, cyl))
# in _targets.R... unclear at best...
list(
tar_target(data1, ???), # if I use DBI::dbReadTable(con, "data1") it duplicates the data and returns it to R
)
The problem is, that if I return the values of the table back to R, this might take too long. Is there a database-aware adapter for targets and is this even possible/intended with targets?
Upvotes: 8
Views: 549
Reputation: 1084
I've experienced the same awkwardness, and created sqltargets as a result. The pattern I now use is:
load_tables_to_db <- function(...) {
tables <- lst(...)
path_to_db <- fs::path("path_to_db")
con <- DBI::dbConnect(
duckdb::duckdb(),
dbdir = path_to_db,
read_only = FALSE
)
on.exit(DBI::dbDisconnect(con, shutdown = TRUE))
walk2(tables, names(tables), \(df, nm) DBI::dbWriteTable(conn = con, name = nm, value = df, overwrite = TRUE))
}
And in _targets.R
:
tar_target(
tables_loaded_to_db,
load_tables_to_db(
table1,
table2,
...
)
),
tar_sql(report1, "query1.sql"),
tar_sql(report2, "query2.sql")
tar_sql
will create an upstream target to the file (e.g. 'report1_query_file'), while 'report1' contains the result of the query. The SQL file must contain a connection string in the header. Dependencies can be specified in comments (e.g. '--tar_load(table1)')
The package is brand new and still a WIP, but may help smooth your workflow.
Upvotes: 6