Reputation: 4180
I have two tables that I want to do a full join using dplyr, but I don't want it to drop any of the columns. Per the documentation and my own experience it is only keeping the join column for the left hand side. This is a problem when you have a row with a record for the right hand side since the join value is gone.
For example, suppose I have the two tables a and b,
customerId | revenue customerId | state
-----------|--------- -----------|-------
1 | 2000 1 | CA
2 | 3000 3 | GA
4 | 4000 4 | NY
doing something like full_join(a, b, by="customerId")
will produce
customerId | revenue | state
-----------|---------|-------
1 | 2000 | CA
2 | 3000 | <NA>
<NA> | <NA> | GA
4 | 4000 | NY
so there is no way to tell which customer that third row is from. The ideal output would be
customerId.a | customerId.b | revenue | state
-------------|--------------|---------|-------
1 | 1 | 2000 | CA
2 | <NA> | 3000 | <NA>
<NA> | 3 | <NA> | GA
4 | 4 | 4000 | NY
note that this is just a toy example. I'm actually using sparklyr so this is all being run in Spark. Thus, merge won't work here for me. Is there a way to do what I'm looking for in dplyr?
EDIT: As someone pointed out this actually is working as desired in dplyr itself locally. However, I do see this problem using sparklyr (which uses dplyr). Here is the code to see that:
library(sparklyr)
sc <- spark_connect("local[4]")
d1 <- data_frame(customerId = c("1","2","4"), revenue=c(2000,3000,4000))
d2 <- data_frame(customerId = c("1","3","4"), state=c("CA", "GA", "NY"))
d1_tbl <- copy_to(sc, d1)
d2_tbl <- copy_to(sc, d2)
full_join(d1_tbl, d2_tbl, by=c("customerId"))
Upvotes: 6
Views: 17300
Reputation: 349
Just use the parameter keep
in dplyr::full_join()
.
full_join(a, b, by="customerId",keep = TRUE)
Upvotes: 1
Reputation: 4762
This has been fixed
> full_join(d1_tbl, d2_tbl, by="customerId")
# Source: lazy query [?? x 3]
# Database: spark_connection
customerId revenue state
<chr> <dbl> <chr>
1 1 2000 CA
2 3 NaN GA
3 2 3000 <NA>
4 4 4000 NY
Upvotes: 2
Reputation: 121077
I can't reproduce your problem. All the IDs should be (and are) included in the full join.
library(data_frame)
d1 <- data_frame(
customerId = c(1, 2, 4),
revenue = c(2000, 3000, 4000)
)
d2 <- data_frame(
customerId = c(1, 3, 4),
state = c("CA", "GA", "NY")
)
full_join(d1, d2, by = "customerId")
## # A tibble: 4 × 3
## customerId revenue state
## <dbl> <dbl> <chr>
## 1 1 2000 CA
## 2 2 3000 <NA>
## 3 4 4000 NY
## 4 3 NA GA
Update: I can reproduce the problem using sparklyr
. It's weird behavior, so you might want to file an issue. (Unclear whether the problem is with sparklyr
or dplyr
or DBI
or Spark SQL though.)
Using explain()
, you can see the SQL that was generated.
full_join(d1_tbl, d2_tbl, by=c("customerId")) %>% explain()
You can try running a custom SQL query to get what you want, though it's a little messier.
library(DBI)
qry <- "SELECT
d1.customerID AS customerID1,
d2.customerID AS customerID2,
d1.revenue,
d2.state
FROM d1
FULL JOIN d2
ON d1.customerId = d2.customerId"
dbGetQuery(sc, qry)
## customerID1 customerID2 revenue state
## 1 1 1 2000 CA
## 2 2 <NA> 3000 <NA>
## 3 <NA> 3 NaN GA
## 4 4 4 4000 NY
Upvotes: 5
Reputation: 214987
You can create separate identical customerId
for both data frames before join:
full_join(
mutate(a, customerId.a = customerId),
mutate(b, customerId.b = customerId),
by="customerId"
) %>% select(-customerId)
# revenue customerId.a state customerId.b
#1 2000 1 CA 1
#2 3000 2 <NA> NA
#3 4000 4 NY 4
#4 NA NA GA 3
Upvotes: 10