Dealing with duplicated columns using full_join in R

Question

Good afternoon!

I'm currently working on a data manipulation task using R and faced with a dilemma.

Two tables are around, and my goal is to join these tables using specific keys.

Table1:

Name <- c("John", "Michael", "Anna", "Boris")
ID <- c("ID1", "ID2", "ID3", "ID4")
PDN <- c(40, 10, 6, 70)
Sum3107 <- c(16, 10, 53, 44)
Sum3108 <- c(16, 8, 50, 43)

table1 <- data.frame(Name, ID, PDN, Sum3107, Sum3108)

And Table2:

Name <- c("Martin", "Anna", "Olga", "Boris")
ID <- c("ID6", "ID3", "ID7", "ID4")
PDN <- c(22, 6, 44, 70)
Sum3009 <- c(10, 8, 45, 30)
Sum3110 <- c(9, 6, 30, 20)

table2 <- data.frame(Name, ID, PDN, Sum3009, Sum3110)

I've opted for a full_join operator as it perfectly solves the task in theory:

table3 <- full_join(table1, table2, by = c("Name", "ID", "PDN"))

Everything is correct because all the repeated columns in these two tables are selected as keys.

But if I need to select as keys only specific column names, and opt for a full_join, R will duplicate columns that are repeated in to tables, which is not the I expect.

table3 <- full_join(table1, table2, by = c("Name", "ID")) #"PDN" was removed

Is it possible to run a join on specific columns rather than all repeated in two tables without getting duplicated results?

Expected result: I want to get a full join from two tables using only two keys (c("Name", "ID")), where "PDN" column is shown but not duplicated in a result section (PDN.x and PDN.y are not around).

Thank you in advance! Any help is highly appreciated!

Jon Spring · Accepted Answer

Does this help? Same output as full join in different order. I'm not specifying PDN, but I am specifying the columns I want to sum, which excludes PDN.

bind_rows(table1, table2) %>%
  group_by(Name, ID) %>%
  summarise(across(contains("Sum"), ~sum(.x, na.rm = T)), .groups = "drop")

I can't yet think of a way to make R treat the PDN column differently from the Sum columns without giving it some indication that it should be treated like a key and/or the others should be treated like values.

Edit - This isn't elegant, but another approach you could take would be to do your desired join, and then "fix it in post." Here done by reshaping long, removing any ".x" or ".y" from column name, filtering for first non-NA, then pivoting wide again.

But this is definitely worse. :-)

full_join(table1, table2, by = c("Name", "ID")) %>%
  pivot_longer(-c(Name, ID)) %>%
  mutate(name = name %>% str_remove(".x|.y")) %>%
  filter(!is.na(value)) %>%
  group_by(Name, ID, name) %>% slice(1) %>% ungroup() %>%
  pivot_wider(names_from = name, values_from = value)

# A tibble: 6 x 7
  Name    ID      PDN Sum3009 Sum3107 Sum3108 Sum3110
                  
1 Anna    ID3       6       8      53      50       6
2 Boris   ID4      70      30      44      43      20
3 John    ID1      40      NA      16      16      NA
4 Martin  ID6      22      10      NA      NA       9
5 Michael ID2      10      NA      10       8      NA
6 Olga    ID7      44      45      NA      NA      30

Dealing with duplicated columns using full_join in R

Answers (1)

Related Questions