Reputation: 61
Lets say I have data frame df1 with the following variables
Continent Country
1 Europe Russia
2 Asia Myanmar (Burma)
3 africa Benin
4 africa Botswana
5 africa Burkina
and df2 with the following variables
Continent Country
1 Europe Russian Federation
2 Asia Myanmar
3 africa Benin,new
4 africa Botswana
5 africa Burkina
How do I combine the 2 df together by Country using partial matching
Upvotes: 0
Views: 284
Reputation: 30474
It might be helpful to know what you final/desired data frame would look like.
You could consider the fuzzyjoin
package in merging these two data frames. One approach would be to use str_detect
and see if one Country
string is contained in the other.
library(tidyverse)
library(fuzzyjoin)
mf <- function(a, b) str_detect(a, b) | str_detect(b, a)
fuzzy_semi_join(df1, df2, by = "Country", match_fun = mf)
Continent Country
1 Europe Russia
2 Asia Myanmar (Burma)
3 Africa Benin
4 Africa Botswana
5 Africa Burkina
An inner join will should you how the rows are matched (keeping both Country
columns for comparison):
fuzzy_inner_join(df1, df2, by = "Country", match_fun = mf)
Continent.x Country.x Continent.y Country.y
1 Europe Russia Europe Russian Federation
2 Asia Myanmar (Burma) Asia Myanmar
3 Africa Benin Africa Benin,new
4 Africa Botswana Africa Botswana
5 Africa Burkina Africa Burkina
Upvotes: 0
Reputation: 2770
You can merge on the first five characters. You will need to install the stringr
package
replicating your data
a<- data.frame( Continent=c("Europe","Asia","africa","africa","africa"), Country=c("Russia","Myanmar (Burma)","Benin","Botswana","Burkina"))
b <- data.frame( Continent=c("Europe","Asia","africa","africa","africa"), Country=c("Russian Federation","Myanmar","Benin,new","Botswana","Burkina"))
create a variable taking the lower case first five letters
a$key <- stringr::str_extract(tolower(a$Country), "\\b[a-z]{0,5}")
b$key <- stringr::str_extract(tolower(b$Country), "\\b[a-z]{0,5}")
and then merge on the new key (you will probably want to rename your cols before this merge
merge( a , b , by="key")
Upvotes: 2