Cameron
Cameron

Reputation: 367

How to join rows that match any of multiple columns

I'd like to join two dataframes that list different countries. However, the countries' names are not consistent. One df has a single column for each countries' name, and the other has multiple columns with different names for that country. e.g.

df1 <- 
  tibble(
    country = c("United Kingdom", "USA", "China", "India", "Russia"),
    value = c(1:5))

df2 <- 
  tibble(
    admin_full = c(
        "United Kingdom", 
        "United States of America", 
        "People's Republic of China",  
        "India", 
        "Russian Federation"),
    admin_short = c("UK", "United States", "China", "India", "Russia"),
    admin_abbreviated = c("UK", "USA", "PRC", "India", "Russia"),
    group = c("High", "High", "Medium", "Low", "Low"))

I would like to join the dfs where df1 matches any of the names in df2. So

I'm working with a list of ~200 countries so I don't want to have to go through an manually pick out which one of a dozen columns has the correct format to match that particular country on. I want to match where the name in df1 matches any one of multiple columns in df2.

Upvotes: 1

Views: 405

Answers (1)

Jon Spring
Jon Spring

Reputation: 66900

It sounds like any of the three admin_ columns could match, so my approach here is to make an extra column from the admin_full (a key for that country) and bring the three variations into different rows in a country column. Then we can join the two tables, and use distinct to keep only one match for each.

library(dplyr)
left_join(df1,
  df2 %>%
    mutate(full_name = admin_full) %>%
    pivot_longer(-c(full_name, group), values_to ="country")
  ) %>%
  distinct(country, value, group, full_name)

Result

# A tibble: 5 × 4
  country        value group  full_name                 
  <chr>          <int> <chr>  <chr>                     
1 United Kingdom     1 High   United Kingdom            
2 USA                2 High   United States of America  
3 China              3 Medium People's Republic of China
4 India              4 Low    India                     
5 Russia             5 Low    Russian Federation  

Or if you don't need a "canonical" name, you could shorten:

left_join(df1, pivot_longer(df2, -group, values_to ="country")) %>%
  distinct(country, value, group)

# A tibble: 5 × 3
  country        value group 
  <chr>          <int> <chr> 
1 United Kingdom     1 High  
2 USA                2 High  
3 China              3 Medium
4 India              4 Low   
5 Russia             5 Low  

Upvotes: 2

Related Questions