Merging data frames with partial Matching using R

Lets say I have data frame df1 with the following variables

Continent   Country
1   Europe  Russia
2   Asia    Myanmar (Burma)
3   africa  Benin
4   africa  Botswana
5   africa  Burkina

and df2 with the following variables

Continent   Country
1   Europe  Russian Federation
2   Asia    Myanmar
3   africa  Benin,new
4   africa  Botswana
5   africa  Burkina

How do I combine the 2 df together by Country using partial matching

Upvotes: 0

Views: 284

Answers (2)

Ben
Ben

Reputation: 30474

It might be helpful to know what you final/desired data frame would look like.

You could consider the fuzzyjoin package in merging these two data frames. One approach would be to use str_detect and see if one Country string is contained in the other.

library(tidyverse)
library(fuzzyjoin)

mf <- function(a, b) str_detect(a, b) | str_detect(b, a)

fuzzy_semi_join(df1, df2, by = "Country", match_fun = mf)

  Continent         Country
1    Europe          Russia
2      Asia Myanmar (Burma)
3    Africa           Benin
4    Africa        Botswana
5    Africa         Burkina

An inner join will should you how the rows are matched (keeping both Country columns for comparison):

fuzzy_inner_join(df1, df2, by = "Country", match_fun = mf)

  Continent.x       Country.x Continent.y          Country.y
1      Europe          Russia      Europe Russian Federation
2        Asia Myanmar (Burma)        Asia            Myanmar
3      Africa           Benin      Africa          Benin,new
4      Africa        Botswana      Africa           Botswana
5      Africa         Burkina      Africa            Burkina

Upvotes: 0

MatthewR
MatthewR

Reputation: 2770

You can merge on the first five characters. You will need to install the stringr package

replicating your data

a<- data.frame( Continent=c("Europe","Asia","africa","africa","africa"), Country=c("Russia","Myanmar (Burma)","Benin","Botswana","Burkina"))
b <- data.frame( Continent=c("Europe","Asia","africa","africa","africa"), Country=c("Russian Federation","Myanmar","Benin,new","Botswana","Burkina"))

create a variable taking the lower case first five letters

 a$key <- stringr::str_extract(tolower(a$Country), "\\b[a-z]{0,5}")
 b$key <- stringr::str_extract(tolower(b$Country), "\\b[a-z]{0,5}")

and then merge on the new key (you will probably want to rename your cols before this merge

  merge( a , b , by="key")

Upvotes: 2

Related Questions