Reputation: 15738
Using the cars
dataset (with a postgres connection con_psql
):
tmp_cars_sdf <-
copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)
We see that the following two sequences of operations, that only interchange the order of filter
and arrange
, lead to different SQL translations:
tmp_cars_sdf %>%
filter(speed == 4) %>%
arrange(dist) %>%
sql_render
# <SQL> SELECT *
# FROM "tmp_cars_sdf"
# WHERE ("speed" = 4.0)
# ORDER BY "dist"
vs
tmp_cars_sdf %>%
arrange(dist) %>%
filter(speed == 4) %>%
sql_render
# <SQL> SELECT *
# FROM (SELECT *
# FROM "tmp_cars_sdf"
# ORDER BY "dist") "dbplyr_006"
# WHERE ("speed" = 4.0)
I am not an expert in SQL but it seems like the ordering is only guaranteed in the former, and not the latter, when collecting results, or using the remote table in further operations:
Is order in a subquery guaranteed to be preserved? https://dba.stackexchange.com/questions/82930/database-implementations-of-order-by-in-a-subquery
What should I do about this?
Having done some further investigation I am not sure whether this is important from a data analysis point of view (i.e. treating remote tables just like data frames in ram).
tmp_cars_sdf %>%
arrange(dist) %>%
group_by(speed) %>%
filter(dist > 10) %>%
mutate(lag_dist = lag(dist)) %>%
sql_render
# <SQL> SELECT "speed", "dist", LAG("dist", 1, NULL) OVER
# (PARTITION BY "speed" ORDER BY "dist") AS "lag_dist"
# FROM (SELECT *
# FROM "tmp_cars_sdf"
# ORDER BY "dist") "dbplyr_014"
# WHERE ("dist" > 10.0)
While it seems that the arrange()
seems to be translated into ORDER BY
in a subquery, it turns out that this is not important, at least for calculating a new variable based on the order since the ordering information is not 'forgotten'.
Upvotes: 1
Views: 286
Reputation: 6941
You are correct, ordering is only preserved in your first example, not in your second. In fact, your second example may cause errors in some versions of SQL that do not accept ORDER BY
clauses in subqueries.
When you use dbplyr
the remote table is essentially defined by the SQL query that constructs the current state of the table. dbplyr
translation into SQL works incrementally/one command at a time, taking the existing query and augmenting it to reflect the next transformation. This often results in the previous query becoming a subquery of the new query.
In general, only when results are returned to R (such as by collect
) does the query evaluate. Some tricks can be used to force & save intermediate evaluations, but I have not found any that guarantee ordering.
Given the above, my recommendation is:
order by
clauses within functions instead of arrange
when order matters for creating new variables. For example:
my_data %>% mutate(new = lag(old, order_by = "date"))
my_data %>% arrange(date) %>% mutate(new = lag(old))
You may also find this helpful for row numbers/rank.
Upvotes: 1