cdeterman
cdeterman

Reputation: 19960

dplyr Filter Database Table with Large Number of Matches

I am working with dplyr and the dbplyr package to interface with my database. I have a table with millions of records. I also have a list of values that correspond to the key in that same table I wish to filter. Normally I would do something like this to filter the table.

library(ROracle)

# connect info omitted
con <- dbConnect(...)

# df with values - my_values

con %>% tbl('MY_TABLE') %>% filter(FIELD %in% my_values$FIELD)

However, that my_values object contains over 500K entries (hence why I don't provide actual data here). This is clearly not efficient when they will basically be put in an IN statement (It essentially hangs). Normally if I was writing SQL, I would create a temporary table and write a WHERE EXISTS clause. But in this instance, I don't have write privileges.

How can I make this query more efficient in R?

Upvotes: 5

Views: 474

Answers (2)

edgararuiz
edgararuiz

Reputation: 675

Looking at your restrictions, I would approach it similar to how Polor Beer suggested, but I would send one db command per value using purrr::map and then use dplyr::bindrows() at the end. This way you'll have a nice piped code that will adapt if your list changes. Not ideal, but unless you're willing to write a SQL table variable manually, not sure of any other solutions.

Upvotes: 0

Polor Beer
Polor Beer

Reputation: 2012

Note sure whether this will help, but a few suggestions:

  1. Find other criteria for filtering. For example, if my_values$FIELD is consecutive or the list of values can be inferred by some other columns, you can seek help from the between filter: filter(between(FIELD, a, b))?
  2. Divide and conquer. Split my_values into small batches, make queries for each batch, then combine the results. This may take a while, but should be stable and worth the wait.

Upvotes: 1

Related Questions