Reputation: 1111
I am sending a simple query to a MySQL database using dbplyr
:
library(dbplyr)
my_conn<-dbConnect(...)
tab<-tbl(my_conn, "myTable")
tab %>% select(id, date, type) %>%
filter(type = 'foobar')
However, if I inspect the generated SQL with show_query()
, I get this:
SELECT *
FROM (SELECT `id`, `date`, `type`
FROM `myTable`) `q01`
WHERE (`type` == 'foobar')
This query is very slow to execute.
If I instead execute the following SQL command on the server:
SELECT id, date, type FROM db.myTable WHERE type = 'foobar'
then the return is nearly instantaneous.
My question is: why is dbplyr
doing SELECT *
(i.e. select all) and then doing a nested select in line 2? Also, why has "q01" appeared in this query? Could this be why my query is very slow to run compared to executing the minimal command directly on the server? I would perhaps expect dbplyr
to create inefficient SQL queries as the complexity of the operations increases but--in this case at least--I can't really write a more succinct set of operations. It's just a select and filter (SELECT and WHERE).
Upvotes: 1
Views: 483
Reputation: 6941
dbplyr is generating the SQL query as I would expect. What it has done is one query inside another:
SELECT id, date, type FROM myTable
Is a subquery in the super query
SELECT *
FROM (
subquery
) q01
WHERE type = foobar
The q01
is the name given to the subquery. In the same way as the AS
keyword. For example: FROM very_long_table_name AS VLTN
.
Yes, this nesting is ugly. But many SQL engines have a query optimizer that calculates the best way to execute a query. On SQL Server, I have noticed little difference in performance because the query optimizer finds a faster way to execute than as written.
However, it appears that for MySQL, nested queries are known to result in slower performance. See here, here, and here.
One thing that might solve this is changing the order of the select
and filter
commands in R:
tab %>%
filter(type = 'foobar') %>%
select(id, date, type)
Will probably produce the translated query:
SELECT `id`, `date`, `type`
FROM `myTable`
WHERE (`type` == 'foobar')
Which will perform better.
Upvotes: 4
Reputation: 2414
I think you want to string it all together. Once you do tab <- tbl(my_conn, "myTable")
, you've downloaded the whole table.
tbl(my_conn, "myTable") %>%
select(id, date, type) %>%
filter(type = 'foobar')
Upvotes: -1