Reputation: 19960
I am working with dplyr
and the dbplyr
package to interface with my database. I have a table with millions of records. I also have a list of values that correspond to the key in that same table I wish to filter. Normally I would do something like this to filter the table.
library(ROracle)
# connect info omitted
con <- dbConnect(...)
# df with values - my_values
con %>% tbl('MY_TABLE') %>% filter(FIELD %in% my_values$FIELD)
However, that my_values
object contains over 500K entries (hence why I don't provide actual data here). This is clearly not efficient when they will basically be put in an IN
statement (It essentially hangs). Normally if I was writing SQL, I would create a temporary table and write a WHERE EXISTS
clause. But in this instance, I don't have write privileges.
How can I make this query more efficient in R?
Upvotes: 5
Views: 474
Reputation: 675
Looking at your restrictions, I would approach it similar to how Polor Beer suggested, but I would send one db command per value using purrr::map
and then use dplyr::bindrows()
at the end. This way you'll have a nice piped code that will adapt if your list changes. Not ideal, but unless you're willing to write a SQL table variable manually, not sure of any other solutions.
Upvotes: 0
Reputation: 2012
Note sure whether this will help, but a few suggestions:
my_values$FIELD
is consecutive or the list of values can be inferred by some other columns, you can seek help from the between
filter: filter(between(FIELD, a, b))
?my_values
into small batches, make queries for each batch, then combine the results. This may take a while, but should be stable and worth the wait.Upvotes: 1