bheavner
bheavner

Reputation: 559

how filter() a db-backed tibble using a variable?

I am attempting to filter a database-backed tibble by piping to a filter() command and am observing unexpected behavior:

If I filter using filter(pos == variable), I get the same result regardless of what value I assign to the variable. However, filtering with the value, e.g. filter(pos == 12345) works - the result changes as it should for each different value I'm filtering on.

Is this an aspect of lazy evaluation or tidyeval? What is the correct way to filter() a DB-backed tibble using a variable?

Here's a reproducible example:

library(dplyr)

con <- DBI::dbConnect(RSQLite::SQLite(), path = ":dbname:")

ex_data <- tibble(
  pos = c(10510138, 10510507),
  ref = c("CATA", "TCA"),
  alt = c("C", "T")
)

copy_to(con, ex_data, "variants", temporary = FALSE)

toQueryDB <- tbl(con, "variants")

pos = 10510138
(result <- toQueryDB %>% filter(pos == pos)  %>% select(pos, ref, alt) %>% head(1))
# 10510138 CATA  C 

pos = 10510507
(result <- toQueryDB %>% filter(pos == pos)  %>% select(pos, ref, alt) %>% head(1))
# STILL 10510138 CATA  C !!!

(result <- toQueryDB %>% filter(pos == 10510138)  %>% select(pos, ref, alt) %>% head(1))
# 10510138 CATA  C

(result <- toQueryDB %>% filter(pos == 10510507)  %>% select(pos, ref, alt) %>% head(1))
# 10510507 TCA   T 

DBI::dbDisconnect(con)

And my session info:

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.7.6

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18     rstudioapi_0.7   bindr_0.1.1      magrittr_1.5     tidyselect_0.2.4 bit_1.1-14       R6_2.2.2        
 [8] rlang_0.2.2      fansi_0.3.0      blob_1.1.1       tools_3.5.1      utf8_1.1.4       cli_1.0.1        DBI_1.0.0       
[15] dbplyr_1.2.2     yaml_2.2.0       bit64_0.9-7      assertthat_0.2.0 digest_0.6.17    tibble_1.4.2     crayon_1.3.4    
[22] bindrcpp_0.2.2   purrr_0.2.5      memoise_1.1.0    glue_1.3.0       RSQLite_2.1.1    compiler_3.5.1   pillar_1.3.0    
[29] pkgconfig_2.0.2 

Upvotes: 3

Views: 512

Answers (2)

MS Berends
MS Berends

Reputation: 5269

To elaborate on MrFlick's answer, you can also use !! followed by parentheses for an in-place calculation instead of just evaluating an object, such as:

con %>%
  tbl("some_table") %>%
  filter(date > !!(Sys.Date() - 1000))

Longer, reproducible answer:

library(dplyr)
# we'll use duckdb here, of course
library(duckdb)

# connect, will be in-memory
db <- dbConnect(duckdb())

# create data set with 10k dates between 25k days ago and today
df <- data.frame(date = Sys.Date() + runif(10000, min = -25000, max = 0),
                 value = runif(10000))

# write to the duckdb
dbWriteTable(db, "test", df)

# I want all values from last 1000 days, though this won't work:
db %>%
  tbl("test") %>%
  filter(date > Sys.Date() - 1000)
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> ! rapi_prepare: Failed to prepare query SELECT *
#> FROM test
#> WHERE (date > (Sys.Date() - 1000.0))
#> LIMIT 11
#> Error: Catalog Error: Scalar Function with name date does not exist!
#> Did you mean "main.day"?
#> LINE 3: WHERE (date > (Sys.Date() - 1000.0))
#>                        ^

# now let the magic start, use !! with parentheses:
db %>%
  tbl("test") %>%
  filter(date > !!(Sys.Date() - 1000))
#> # Source:   SQL [?? x 2]
#> # Database: DuckDB 0.8.1 [root@Darwin 22.6.0:R 4.3.1/:memory:]
#>    date        value
#>    <date>      <dbl>
#>  1 2022-05-28 0.809 
#>  2 2021-10-11 0.267 
#>  3 2023-01-09 0.657 
#>  4 2021-12-05 0.0280
#>  5 2020-11-10 0.0411
#>  6 2021-12-11 0.0912
#>  7 2021-04-05 0.726 
#>  8 2021-02-10 0.315 
#>  9 2020-12-18 0.877 
#> 10 2022-11-01 0.230 
#> # ℹ more rows
#> # ℹ Use `print(n = ...)` to see more rows

Tada!

Upvotes: 0

MrFlick
MrFlick

Reputation: 206586

When you have

filter(pos == pos)

dplyr doesn't try to figure out which pos is which and it assumes they both are coming from the data that's piped into the function. If you want to inject the value of a variable from outside the data being passed in, then you need to use the bang-bang rlang operator (!!). You should use

filter(pos == !!pos)

Upvotes: 6

Related Questions