David
David

Reputation: 10222

R Targets with DBI/SQL database

I have a targets pipeline which should fill and use a DBI database (in this case duckdb. Both SQL as well as duckdb are chosen for performance reasons).

How can I effectively use {targets} in combination with an SQL table.

A minimal, reproducible example would look like this, which reads two datasets, combines them, creates a model, and a plot.

# in _targets.R
library(targets)

# preparation once:
# d <- ggplot2::mpg |> dplyr::mutate(id = 1:dplyr::n())
# write_csv(d |> dplyr::select(id, cty), "data1.csv")
# write_csv(d |> dplyr::select(id, displ, cyl), "data2.csv")

get_data <- function(file) {
  read_csv(file, col_types = cols()) %>%
    as_tibble()
}

combine_data <- function(d1, d2) {
  left_join(d1, d2, by = "id")
}

fit_model <- function(data) {
  lm(cty ~ displ + cyl, data) |> 
    coefficients()
}

plot_model <- function(model, data) {
  ggplot(data) +
    geom_point(aes(x = displ, y = cty, color = cyl)) +
    geom_abline(intercept = model[1], slope = model[2]) +
    theme_gray(24)
}

tar_option_set(packages = c("tibble", "readr", "dplyr", "ggplot2"))

list(
  tar_target(file1, "data1.csv", format = "file"),
  tar_target(file2, "data2.csv", format = "file"),
  tar_target(data1, get_data(file1)),
  tar_target(data2, get_data(file2)),
  tar_target(data, combine_data(data1, data2)),
  tar_target(model, fit_model(data)),
  tar_target(plot, plot_model(model, data))
)

# run: tar_make() -> runs all targets
# change a value in data2.csv -> tar_make() -> rebuilds only dependencies

This should then be used with this connection to a database

# write to Database
con <- DBI::dbConnect(duckdb::duckdb(), "data.db")
on.exit(DBI::dbDisconnect(con, shutdown = TRUE), add = TRUE)
DBI::dbWriteTable(con, "data1", d |> dplyr::select(id, cty))
DBI::dbWriteTable(con, "data2", d |> dplyr::select(id, displ, cyl))

# in _targets.R... unclear at best...
list(
  tar_target(data1, ???), # if I use DBI::dbReadTable(con, "data1") it duplicates the data and returns it to R
)

The problem is, that if I return the values of the table back to R, this might take too long. Is there a database-aware adapter for targets and is this even possible/intended with targets?

Upvotes: 8

Views: 549

Answers (1)

David Ranzolin
David Ranzolin

Reputation: 1084

I've experienced the same awkwardness, and created sqltargets as a result. The pattern I now use is:

load_tables_to_db <- function(...) {
  tables <- lst(...)
  path_to_db <- fs::path("path_to_db")

  con <- DBI::dbConnect(
    duckdb::duckdb(),
    dbdir = path_to_db,
    read_only = FALSE
  )
  on.exit(DBI::dbDisconnect(con, shutdown = TRUE))
  walk2(tables, names(tables), \(df, nm) DBI::dbWriteTable(conn = con, name = nm, value = df, overwrite = TRUE))
}

And in _targets.R:

  tar_target(
    tables_loaded_to_db,
    load_tables_to_db(
      table1,
      table2,
      ...

    )
  ),
  tar_sql(report1, "query1.sql"),
  tar_sql(report2, "query2.sql")

tar_sql will create an upstream target to the file (e.g. 'report1_query_file'), while 'report1' contains the result of the query. The SQL file must contain a connection string in the header. Dependencies can be specified in comments (e.g. '--tar_load(table1)')

The package is brand new and still a WIP, but may help smooth your workflow.

Upvotes: 6

Related Questions