tchoup
tchoup

Reputation: 1023

test if words are in a string (grepl, fuzzyjoin?)

I need to do a match and join on two data frames if the string from two columns of one data frame are contained in the string of a column from a second data frame.

Example dataframe:

First <- c("john", "jane", "jimmy", "jerry", "matt", "tom", "peter", "leah")
Last  <- c("smith", "doe", "mcgee", "bishop", "gibbs", "dinnozo", "lane", "palmer")
Name  <- c("mr john smith","", "timothy t mcgee", "dinnozo tom", "jane  l doe", "jimmy mcgee", "leah elizabeth arthur palmer and co", "jerry bishop the cat")
ID    <- c("ID1", "ID2", "ID3", "ID4", "ID5", "ID6", "ID7", "ID8")

df1 <- data.frame(First, Last)
df2 <- data.frame(Name, ID)

So basically, I have df1 which has fairly orderly names of people in first and last name; I have df2, which has names which may be organized as "First Name, Last Name", or "Last Name First Name" or "First Name MI Last Name" or something else entirely that also contains the name. I need the ID column from df2. So I want to run a code to see if df1$First and df2$Last is somewhere in the string of df2$Name, and if it is have it pull and join df2$ID to df1.

My R guru told me to use fuzzy_left_join from the fuzzyjoin package:

fzjoin <- fuzzy_left_join(df1, df2, by = c("First" = "Name"), match_fun = "contains")

but it gives me an error where the argument is not logical; and I can't figure out how to rewrite it to do what I want; the documentation says that match_fun should be TRUE or FALSE, but I don't know what to do with that. Also, it only matches on df1$First rather than df1$First and df1$Last. I think I might be able to use the grepl, but not sure how based on examples I've seen. Any advice?

Upvotes: 2

Views: 448

Answers (1)

Gregor Thomas
Gregor Thomas

Reputation: 145765

The documentation says that match_fun should be a "Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match." It's not TRUE or FALSE, it's a function that returns TRUE or FALSE. If we switch your order, we can use stringr::str_detect, which does return TRUE or FALSE as required.

fuzzyjoin::fuzzy_left_join(
  df2, df1,
  by = c("Name" = "First", "Name" = "Last"),
  match_fun = stringr::str_detect
)
#                                  Name  ID First    Last
# 1                       mr john smith ID1  john   smith
# 2                                     ID2  <NA>    <NA>
# 3                     timothy t mcgee ID3  <NA>    <NA>
# 4                         dinnozo tom ID4   tom dinnozo
# 5                         jane  l doe ID5  jane     doe
# 6                         jimmy mcgee ID6 jimmy   mcgee
# 7 leah elizabeth arthur palmer and co ID7  leah  palmer
# 8                jerry bishop the cat ID8 jerry  bishop

Upvotes: 3

Related Questions