Reputation: 333
Good afternoon!
I'm currently working on a data manipulation task using R and faced with a dilemma.
Two tables are around, and my goal is to join these tables using specific keys.
Table1:
Name <- c("John", "Michael", "Anna", "Boris")
ID <- c("ID1", "ID2", "ID3", "ID4")
PDN <- c(40, 10, 6, 70)
Sum3107 <- c(16, 10, 53, 44)
Sum3108 <- c(16, 8, 50, 43)
table1 <- data.frame(Name, ID, PDN, Sum3107, Sum3108)
And Table2:
Name <- c("Martin", "Anna", "Olga", "Boris")
ID <- c("ID6", "ID3", "ID7", "ID4")
PDN <- c(22, 6, 44, 70)
Sum3009 <- c(10, 8, 45, 30)
Sum3110 <- c(9, 6, 30, 20)
table2 <- data.frame(Name, ID, PDN, Sum3009, Sum3110)
I've opted for a full_join operator as it perfectly solves the task in theory:
table3 <- full_join(table1, table2, by = c("Name", "ID", "PDN"))
Everything is correct because all the repeated columns in these two tables are selected as keys.
But if I need to select as keys only specific column names, and opt for a full_join, R will duplicate columns that are repeated in to tables, which is not the I expect.
table3 <- full_join(table1, table2, by = c("Name", "ID")) #"PDN" was removed
Is it possible to run a join on specific columns rather than all repeated in two tables without getting duplicated results?
Expected result: I want to get a full join from two tables using only two keys (c("Name", "ID")), where "PDN" column is shown but not duplicated in a result section (PDN.x and PDN.y are not around).
Thank you in advance! Any help is highly appreciated!
Upvotes: 0
Views: 932
Reputation: 66570
Does this help? Same output as full join in different order. I'm not specifying PDN, but I am specifying the columns I want to sum, which excludes PDN.
bind_rows(table1, table2) %>%
group_by(Name, ID) %>%
summarise(across(contains("Sum"), ~sum(.x, na.rm = T)), .groups = "drop")
I can't yet think of a way to make R treat the PDN column differently from the Sum columns without giving it some indication that it should be treated like a key and/or the others should be treated like values.
Edit - This isn't elegant, but another approach you could take would be to do your desired join, and then "fix it in post." Here done by reshaping long, removing any ".x" or ".y" from column name, filtering for first non-NA, then pivoting wide again.
But this is definitely worse. :-)
full_join(table1, table2, by = c("Name", "ID")) %>%
pivot_longer(-c(Name, ID)) %>%
mutate(name = name %>% str_remove(".x|.y")) %>%
filter(!is.na(value)) %>%
group_by(Name, ID, name) %>% slice(1) %>% ungroup() %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 6 x 7
Name ID PDN Sum3009 Sum3107 Sum3108 Sum3110
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Anna ID3 6 8 53 50 6
2 Boris ID4 70 30 44 43 20
3 John ID1 40 NA 16 16 NA
4 Martin ID6 22 10 NA NA 9
5 Michael ID2 10 NA 10 8 NA
6 Olga ID7 44 45 NA NA 30
Upvotes: 1