Alex
Alex

Reputation: 15738

dplyr's arrange operation does not seem to be commutative within a remote source, when does it matter?

Using the cars dataset (with a postgres connection con_psql):

tmp_cars_sdf <-
    copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)

We see that the following two sequences of operations, that only interchange the order of filter and arrange, lead to different SQL translations:

tmp_cars_sdf %>%
    filter(speed == 4) %>%
    arrange(dist) %>%
    sql_render

# <SQL> SELECT *
#     FROM "tmp_cars_sdf"
# WHERE ("speed" = 4.0)
# ORDER BY "dist"

vs

tmp_cars_sdf %>%
    arrange(dist) %>%
    filter(speed == 4) %>%
    sql_render

# <SQL> SELECT *
#     FROM (SELECT *
#               FROM "tmp_cars_sdf"
#           ORDER BY "dist") "dbplyr_006"
# WHERE ("speed" = 4.0)

I am not an expert in SQL but it seems like the ordering is only guaranteed in the former, and not the latter, when collecting results, or using the remote table in further operations:

Is order in a subquery guaranteed to be preserved? https://dba.stackexchange.com/questions/82930/database-implementations-of-order-by-in-a-subquery

What should I do about this?

Edit:

Having done some further investigation I am not sure whether this is important from a data analysis point of view (i.e. treating remote tables just like data frames in ram).

tmp_cars_sdf %>%
    arrange(dist) %>%
    group_by(speed) %>%
    filter(dist > 10) %>%
    mutate(lag_dist = lag(dist)) %>%
    sql_render

# <SQL> SELECT "speed", "dist", LAG("dist", 1, NULL) OVER 
# (PARTITION BY "speed" ORDER BY "dist") AS "lag_dist"
# FROM (SELECT *
#           FROM "tmp_cars_sdf"
#       ORDER BY "dist") "dbplyr_014"
# WHERE ("dist" > 10.0)

While it seems that the arrange() seems to be translated into ORDER BY in a subquery, it turns out that this is not important, at least for calculating a new variable based on the order since the ordering information is not 'forgotten'.

Upvotes: 1

Views: 286

Answers (1)

Simon.S.A.
Simon.S.A.

Reputation: 6941

You are correct, ordering is only preserved in your first example, not in your second. In fact, your second example may cause errors in some versions of SQL that do not accept ORDER BY clauses in subqueries.

When you use dbplyr the remote table is essentially defined by the SQL query that constructs the current state of the table. dbplyr translation into SQL works incrementally/one command at a time, taking the existing query and augmenting it to reflect the next transformation. This often results in the previous query becoming a subquery of the new query.

In general, only when results are returned to R (such as by collect) does the query evaluate. Some tricks can be used to force & save intermediate evaluations, but I have not found any that guarantee ordering.

Given the above, my recommendation is:

  • Only use the arrange command for preparing output for viewing by a human as databases generally do not care about order.
  • Use the order by clauses within functions instead of arrange when order matters for creating new variables. For example:
    • Use my_data %>% mutate(new = lag(old, order_by = "date"))
    • Instead of my_data %>% arrange(date) %>% mutate(new = lag(old))

You may also find this helpful for row numbers/rank.

Upvotes: 1

Related Questions