Luise
Luise

Reputation: 654

Can dplyr automatically generate a column with merge results after join operations?

This might seem a somewhat basic question, but after reviewing documentation and searching the Internet I couldn't find an answer to this.

My question is: does R's dplyr include an argument to automatically generate a column with merge results after joining two dataframes? With join operations I mean left_join, right_join, or full_join.

I'm thinking of something similar to the _merge column you can generate on Python's pandas with the argument indicator = True when using merge, or to Stata's _merge variable that is generated by default using the merge command.

For example:

df1 <- data.frame(id = c(1,2), value1 =  c(1,1))
df2 <- data.frame(id = c(1,3), value2 =  c(2,1))
result <- df1 %>% full_join(df2, by = 'id', merge_results_option = TRUE)
# Notice that merge_results_option is the argument I'm looking for with this question, if it exists

Should yield:

|--------------|---------------|--------------|--------------|
|      id      |     value1    |    value2    |    _merge    |
|--------------|---------------|--------------|--------------|
|       1      |       1       |       2      |   both       |
|       2      |       1       |       NA     |   left       |
|       3      |       NA      |       1      |   right      |
|--------------|---------------|--------------|--------------|

Upvotes: 0

Views: 384

Answers (2)

Zo&#235; Turner
Zo&#235; Turner

Reputation: 509

As already said, I don't think this is a functionality of dplyr. Interestingly it's not something that is used in SQL either where joins are very common. However, the column can be recreated using mutate:

df1 <- data.frame(id = c(1,2), value1 =  c(1,1))
df2 <- data.frame(id = c(1,3), value2 =  c(2,1))


result <- df1 %>% 
  left_join(df2, by = 'id') %>% 
  mutate(merge = case_when(!is.na(value1) & !is.na(value2) ~ "both",
                       !is.na(value1) & is.na(value2) ~ "left",
                       is.na(value1) & !is.na(value2) ~ "right",
                       TRUE ~ "missing"
                       ))

Upvotes: 1

Gregor Thomas
Gregor Thomas

Reputation: 146239

This is implemented, more-or-less, by the tidylog package. It prints a message summarizing the number of matches, but does add a column identifying each row.

You can read the discussion on the dplyr Feature request here, which lead to the development of tidylog, which ended with

Given that tidylog has solved this problem so wonderfully, I no longer think this needs to be in dplyr

though some users were still dissatisfied with lack of the column-add functionality. Surprisingly, I didn't see that feature request in the tidylog issues, so maybe you could add it there.

Upvotes: 1

Related Questions