Reputation: 1023
I need to do a match and join on two data frames if the string from two columns of one data frame are contained in the string of a column from a second data frame.
Example dataframe:
First <- c("john", "jane", "jimmy", "jerry", "matt", "tom", "peter", "leah")
Last <- c("smith", "doe", "mcgee", "bishop", "gibbs", "dinnozo", "lane", "palmer")
Name <- c("mr john smith","", "timothy t mcgee", "dinnozo tom", "jane l doe", "jimmy mcgee", "leah elizabeth arthur palmer and co", "jerry bishop the cat")
ID <- c("ID1", "ID2", "ID3", "ID4", "ID5", "ID6", "ID7", "ID8")
df1 <- data.frame(First, Last)
df2 <- data.frame(Name, ID)
So basically, I have df1
which has fairly orderly names of people in first and last name; I have df2
, which has names which may be organized as "First Name, Last Name", or "Last Name First Name" or "First Name MI Last Name" or something else entirely that also contains the name. I need the ID
column from df2
. So I want to run a code to see if df1$First
and df2$Last
is somewhere in the string of df2$Name
, and if it is have it pull and join df2$ID
to df1
.
My R guru told me to use fuzzy_left_join
from the fuzzyjoin
package:
fzjoin <- fuzzy_left_join(df1, df2, by = c("First" = "Name"), match_fun = "contains")
but it gives me an error where the argument is not logical; and I can't figure out how to rewrite it to do what I want; the documentation says that match_fun
should be TRUE
or FALSE
, but I don't know what to do with that. Also, it only matches on df1$First
rather than df1$First
and df1$Last
. I think I might be able to use the grepl, but not sure how based on examples I've seen. Any advice?
Upvotes: 2
Views: 448
Reputation: 145765
The documentation says that match_fun
should be a "Vectorized function given two columns, returning TRUE
or FALSE
as to whether they are a match." It's not TRUE or FALSE, it's a function that returns TRUE
or FALSE
. If we switch your order, we can use stringr::str_detect
, which does return TRUE
or FALSE
as required.
fuzzyjoin::fuzzy_left_join(
df2, df1,
by = c("Name" = "First", "Name" = "Last"),
match_fun = stringr::str_detect
)
# Name ID First Last
# 1 mr john smith ID1 john smith
# 2 ID2 <NA> <NA>
# 3 timothy t mcgee ID3 <NA> <NA>
# 4 dinnozo tom ID4 tom dinnozo
# 5 jane l doe ID5 jane doe
# 6 jimmy mcgee ID6 jimmy mcgee
# 7 leah elizabeth arthur palmer and co ID7 leah palmer
# 8 jerry bishop the cat ID8 jerry bishop
Upvotes: 3