Reputation: 3061
If I run the following trivial example, I get the expected output:
library(dplyr)
library(dtplyr)
library(data.table)
dt1 <- lazy_dt(data.table(a = 1:5, b = 6:10))
dt2 <- lazy_dt(data.table(a = letters[1:5], b = 6:10))
dt1 %>%
left_join(
dt2,
by = "b"
) %>%
as.data.table()
> b a.x a.y
> 1: 6 1 a
> 2: 7 2 b
> 3: 8 3 c
> 4: 9 4 d
> 5: 10 5 e
Note that the conflicting columns a
are properly managed, using the standard dplyr
format of adding .x
and .y
suffixes.
However, if I now try to drop one of the columns:
dt1 %>%
left_join(
dt2,
by = "b"
) %>%
select(
-a.y
) %>%
as.data.table()
> Error in is_character(x) : object 'a.y' not found
Interestingly, if I try to select one of the a
columns (select(a.x)
), I get the same error, but... if I instead try select(a)
(selecting a column which shouldn't really exist anymore), I get the following output:
dt1 %>%
left_join(
dt2,
by = "b"
) %>%
select(
a
) %>%
as.data.table()
> a.b
> 1: 1
> 2: 2
> 3: 3
> 4: 4
> 5: 5
where the selected column is clearly dt1$a
, but for some reason the given column name is a.b
. (if I try select(a.b)
, I get the same object not found
error).
Meanwhile, if I try to drop a
, both a
columns are dropped:
dt1 %>%
left_join(
dt2,
by = "b"
) %>%
select(
-a
) %>%
as.data.table()
> b
> 1: 6
> 2: 7
> 3: 8
> 4: 9
> 5: 10
So, how can I use select
with joins where the tables have conflicting (and not joined-by) columns?
EDIT:
As mentioned in some answers, I can obviously execute the lazy evaluation before the select, which works. However, it throws a warning (since I'd like to keep my object as a data.table
, not a data.frame
) so it doesn't seem to be the intended method:
dt1 %>%
left_join(
dt2,
by = "b"
) %>%
as.data.table() %>%
select(
-a.x
)
> b a.y
> 1: 6 a
> 2: 7 b
> 3: 8 c
> 4: 9 d
> 5: 10 e
> Warning message:
> You are using a dplyr method on a raw data.table, which will call the data
> frame implementation, and is likely to be inefficient.
> *
> * To suppress this message, either generate a data.table translation with
> * `lazy_dt()` or convert to a data frame or tibble with
> * `as.data.frame()`/`as_tibble()`.
Upvotes: 2
Views: 278
Reputation: 3061
This is a bug in the current release of dtplyr
(1.0.0), but has now been fixed in the development version.
Upvotes: 0
Reputation: 1831
The key piece is that dtplyr
uses lazy evaluation. More here https://dtplyr.tidyverse.org/, but the key piece is:
Compared to the previous release, this version of dtplyr is a complete rewrite that focusses only on lazy evaluation triggered by use of lazy_dt(). This means that no computation is performed until you explicitly request it with as.data.table(), as.data.frame() or as_tibble().
In your example, this means that the join hasn't been evaluated before the select, they are being translated and waiting to be evaluated. The translation mechanism allows dtplyr to combine multiple verbs more efficiently into one action. (More on translation here: https://dtplyr.tidyverse.org/articles/translation.html)
There are a few ways to approach the problem. The simplest is to move up the evaluation using as.data.frame()
.
dt1 %>%
left_join(
dt2,
by = "b"
) %>%
as.data.frame() %>%
select(-a.y)
Another way is to get the data.table object, and then subset the columns using data.table
syntax.
Upvotes: 3
Reputation: 8555
Apparently, left_join
does not work with data.table
but works with data.frame
(I didn't know it before).
Therefore, one solution could be to do:
library(dplyr)
library(dtplyr)
library(data.table)
dt1 <- lazy_dt(data.table(a = 1:5, b = 6:10))
dt2 <- lazy_dt(data.table(a = letters[1:5], b = 6:10))
as.data.frame(dt1) %>%
left_join(as.data.frame(dt2), by = "b") %>%
select(-a.y) %>%
as.data.table()
Upvotes: 1