Wasabi
Wasabi

Reputation: 3061

Select after a join with conflicting columns with dtplyr

If I run the following trivial example, I get the expected output:

library(dplyr)
library(dtplyr)
library(data.table)

dt1 <- lazy_dt(data.table(a = 1:5, b = 6:10))
dt2 <- lazy_dt(data.table(a = letters[1:5], b = 6:10))

dt1 %>%
  left_join(
    dt2,
    by = "b"
  ) %>%
  as.data.table()
>     b a.x a.y
> 1:  6   1   a
> 2:  7   2   b
> 3:  8   3   c
> 4:  9   4   d
> 5: 10   5   e

Note that the conflicting columns a are properly managed, using the standard dplyr format of adding .x and .y suffixes.

However, if I now try to drop one of the columns:

dt1 %>%
  left_join(
    dt2,
    by = "b"
  ) %>%
  select(
    -a.y
  ) %>%
  as.data.table()
> Error in is_character(x) : object 'a.y' not found

Interestingly, if I try to select one of the a columns (select(a.x)), I get the same error, but... if I instead try select(a) (selecting a column which shouldn't really exist anymore), I get the following output:

dt1 %>%
  left_join(
    dt2,
    by = "b"
  ) %>%
  select(
    a
  ) %>%
  as.data.table()
>    a.b
> 1:   1
> 2:   2
> 3:   3
> 4:   4
> 5:   5

where the selected column is clearly dt1$a, but for some reason the given column name is a.b. (if I try select(a.b), I get the same object not found error).

Meanwhile, if I try to drop a, both a columns are dropped:

dt1 %>%
  left_join(
    dt2,
    by = "b"
  ) %>%
  select(
    -a
  ) %>%
  as.data.table()
>     b
> 1:  6
> 2:  7
> 3:  8
> 4:  9
> 5: 10

So, how can I use select with joins where the tables have conflicting (and not joined-by) columns?

EDIT:

As mentioned in some answers, I can obviously execute the lazy evaluation before the select, which works. However, it throws a warning (since I'd like to keep my object as a data.table, not a data.frame) so it doesn't seem to be the intended method:

dt1 %>%
  left_join(
    dt2,
    by = "b"
  ) %>%
  as.data.table() %>%
  select(
    -a.x
  )
>     b a.y
> 1:  6   a
> 2:  7   b
> 3:  8   c
> 4:  9   d
> 5: 10   e
> Warning message:
> You are using a dplyr method on a raw data.table, which will call the data 
> frame implementation, and is likely to be inefficient.
> * 
> * To suppress this message, either generate a data.table translation with
> * `lazy_dt()` or convert to a data frame or tibble with
> * `as.data.frame()`/`as_tibble()`.

Upvotes: 2

Views: 278

Answers (3)

Wasabi
Wasabi

Reputation: 3061

This is a bug in the current release of dtplyr (1.0.0), but has now been fixed in the development version.

Upvotes: 0

ravic_
ravic_

Reputation: 1831

The key piece is that dtplyr uses lazy evaluation. More here https://dtplyr.tidyverse.org/, but the key piece is:

Compared to the previous release, this version of dtplyr is a complete rewrite that focusses only on lazy evaluation triggered by use of lazy_dt(). This means that no computation is performed until you explicitly request it with as.data.table(), as.data.frame() or as_tibble().

In your example, this means that the join hasn't been evaluated before the select, they are being translated and waiting to be evaluated. The translation mechanism allows dtplyr to combine multiple verbs more efficiently into one action. (More on translation here: https://dtplyr.tidyverse.org/articles/translation.html)

There are a few ways to approach the problem. The simplest is to move up the evaluation using as.data.frame().

dt1 %>%
  left_join(
    dt2,
    by = "b"
  ) %>%
  as.data.frame() %>%
  select(-a.y)

Another way is to get the data.table object, and then subset the columns using data.table syntax.

Upvotes: 3

bretauv
bretauv

Reputation: 8555

Apparently, left_join does not work with data.table but works with data.frame (I didn't know it before).

Therefore, one solution could be to do:

library(dplyr)
library(dtplyr)
library(data.table)

dt1 <- lazy_dt(data.table(a = 1:5, b = 6:10))
dt2 <- lazy_dt(data.table(a = letters[1:5], b = 6:10))

as.data.frame(dt1) %>%
  left_join(as.data.frame(dt2), by = "b") %>%
  select(-a.y) %>%
  as.data.table()

Upvotes: 1

Related Questions