Reputation: 4180

Is it possible to do a full join in dplyr and keep all the columns used in the join?

I have two tables that I want to do a full join using dplyr, but I don't want it to drop any of the columns. Per the documentation and my own experience it is only keeping the join column for the left hand side. This is a problem when you have a row with a record for the right hand side since the join value is gone.

For example, suppose I have the two tables a and b,

customerId | revenue               customerId | state
-----------|---------              -----------|-------
    1      | 2000                       1     |  CA
    2      | 3000                       3     |  GA
    4      | 4000                       4     |  NY

doing something like full_join(a, b, by="customerId") will produce

customerId | revenue | state
-----------|---------|-------
    1      |   2000  |  CA
    2      |   3000  | <NA>
   <NA>    |   <NA>  |  GA
    4      |   4000  |  NY

so there is no way to tell which customer that third row is from. The ideal output would be

customerId.a | customerId.b | revenue | state
-------------|--------------|---------|-------
      1      |      1       |   2000  |  CA
      2      |     <NA>     |   3000  | <NA>
    <NA>     |      3       |   <NA>  |  GA
      4      |      4       |   4000  |  NY

note that this is just a toy example. I'm actually using sparklyr so this is all being run in Spark. Thus, merge won't work here for me. Is there a way to do what I'm looking for in dplyr?

EDIT: As someone pointed out this actually is working as desired in dplyr itself locally. However, I do see this problem using sparklyr (which uses dplyr). Here is the code to see that:

library(sparklyr)
sc <- spark_connect("local[4]")
d1 <- data_frame(customerId = c("1","2","4"), revenue=c(2000,3000,4000))
d2 <- data_frame(customerId = c("1","3","4"), state=c("CA", "GA", "NY"))
d1_tbl <- copy_to(sc, d1)
d2_tbl <- copy_to(sc, d2)
full_join(d1_tbl, d2_tbl, by=c("customerId"))

Upvotes: 6

Answers (4)

xaviescacs

Reputation: 349

Just use the parameter keep in dplyr::full_join().

full_join(a, b, by="customerId",keep = TRUE)

Upvotes: 1

kevinykuo

Reputation: 4762

This has been fixed

> full_join(d1_tbl, d2_tbl, by="customerId")
# Source:   lazy query [?? x 3]
# Database: spark_connection
  customerId revenue state
       <chr>   <dbl> <chr>
1          1    2000    CA
2          3     NaN    GA
3          2    3000  <NA>
4          4    4000    NY

Upvotes: 2

Richie Cotton

Reputation: 121077

I can't reproduce your problem. All the IDs should be (and are) included in the full join.

library(data_frame)
d1 <- data_frame(
  customerId = c(1, 2, 4),
  revenue = c(2000, 3000, 4000)
)
d2 <- data_frame(
  customerId = c(1, 3, 4),
  state = c("CA", "GA", "NY")
)

full_join(d1, d2, by = "customerId")
## # A tibble: 4 × 3
##   customerId revenue state
##        <dbl>   <dbl> <chr>
## 1          1    2000    CA
## 2          2    3000  <NA>
## 3          4    4000    NY
## 4          3      NA    GA

Update: I can reproduce the problem using sparklyr. It's weird behavior, so you might want to file an issue. (Unclear whether the problem is with sparklyr or dplyr or DBI or Spark SQL though.)

Using explain(), you can see the SQL that was generated.

full_join(d1_tbl, d2_tbl, by=c("customerId")) %>% explain()

You can try running a custom SQL query to get what you want, though it's a little messier.

library(DBI)
qry <- "SELECT 
    d1.customerID AS customerID1, 
    d2.customerID AS customerID2, 
    d1.revenue, 
    d2.state 
  FROM d1 
  FULL JOIN d2 
    ON d1.customerId = d2.customerId"
dbGetQuery(sc, qry)  
##   customerID1 customerID2 revenue state
## 1           1           1    2000    CA
## 2           2        <NA>    3000  <NA>
## 3        <NA>           3     NaN    GA
## 4           4           4    4000    NY

Upvotes: 5

akuiper

Reputation: 214987

You can create separate identical customerId for both data frames before join:

full_join(
    mutate(a, customerId.a = customerId), 
    mutate(b, customerId.b = customerId), 
    by="customerId"
) %>% select(-customerId)

#  revenue customerId.a state customerId.b
#1    2000            1    CA            1
#2    3000            2  <NA>           NA
#3    4000            4    NY            4
#4      NA           NA    GA            3

Upvotes: 10

Is it possible to do a full join in dplyr and keep all the columns used in the join?

Answers (4)

Related Questions