Reputation: 393
I have to compare two people datasets between each other.
Let's say i have a data frame with few columns a =
.
ID | Name | Gender | Country
——————————————————————————————————————————————————————————
1 | Mattias Adams | M | UK
2 | James Alan | M | Canada
3 | Dana Benton | F | USA
4 | Ella Collins | F | USA
And b =
ID | First_Name | Last_name | Third_name | Whole_name | Gender
————————————————————————————————————————————————————————————————————————————
1 | Gary | Cole | Allan | Gary Allan Cole | M
2 | Dana | Benton | NA | Dana Benton | F
3 | Lena | Jamison | Anne | Lena Anne Jamison | F
4 | Matt | King | NA | Matt King | M
Data frame a
is the bigger one contains around 100,000 rows while b
contains less than 1,000.
The goal is to use the data in b
to find matching records in a
. So that the whole row in a
is returned if there is a match.
I want to try two ways. First to find exact matches from b$"Whole_name"
in a$"Name"
.
Exact matching:
eue_wn <- as.character(b$"Whole_name")
eue_wn_match <- a[which(as.character(a$"Name") %in% eue_wn),]
if (nrow(eue_wn_match) == 0) {
eue_wn_match <- "No matches"
}
Output of eue_wn_matc
in this case would be:
ID | Name | Gender | Country
——————————————————————————————————————————————————————————
3 | Dana Benton | F | USA
Pattern matching:
eup_ln <- paste(as.character(b$"Last_name"), collapse = "|")
eup_fn <- paste(as.character(b$"First_Name"), collapse = "|")
eup_tn <- paste(as.character(b$"Third_name"), collapse = "|")
eup_match <- a[which(grepl(eup_ln, as.character(a$"Name"), ignore.case = TRUE)),] #First filter (last name)
if (nrow(eup_match) == 0) {
eup_match <- "No matches"
}
if (nrow(eup_match) > 0) {
eup_match2 <- eup_match[which(grepl(eup_fn, as.character(eup_match$"Name"), ignore.case = TRUE)),] #Second filter (first name)
if (nrow(eup_match2) == 0 ) {
eup_match2 <- "No matches"
}
}
if (nrow(eup_match2) > 0) {
eup_match3 <- eup_match2[which(grepl(eup_tn, as.character(eup_match2$"Name"), ignore.case = TRUE)),] #Third filter (third_name)
if (nrow(eup_match3) == 0 ) {
eup_match3 <- "No matches"
}
}
So in this process the matching takes place in 3 stages.
First eup_match
is the result of finding the last name. Than it takes that result and looks for second match which is the first name name results eup_match2
shows record that matches both conditions. Finally the last result is taken and is being matched also with the third name eup_match3
in this case the result of all three of them is the same:
ID | Name | Gender | Country
——————————————————————————————————————————————————————————
3 | Dana Benton | F | USA
And that is incorrect. Only eup_match
and eup_match2
should have that output. Since in the first stage we were matching Dana Benton(a)
and Dana(b)
In the next stage the match was Dana Benton(a)
and Benton (b)
. And since she does not have a third name it is impossible to match her with third name.
The problem is in:
eup_tn <- paste(as.character(b$"Third_name"), collapse = "|")
The output off this is :
"Allan|NA|Anne|NA"
Because the NA was converted to character the function was able to find pattern in a and b. In this particular case Dana Benson (a
) and NA (b)
Any idea on how to correct that ?
Another question is related to the output. Is there any way to output both results from a
and b
Example: if we are only matching the a$Name
with b$First_Name
by patterns the result would be
ID | Name | Gender | Country | Match | Match ID
———————————————————————————————————————————————————————————————————————————
1 | Mattias Adams | M | UK | Matt | 4
3 | Dana Benton | F | USA | Dana | 2
So that the first 4 columns are from the data set a
and the last two from b
Columns Match | Match ID
would show based on what were the records in b
matched.
The desired output for the test example given would be:
ID | Name | Gender | Country
——————————————————————————————————————————————————————————
3 | Dana Benton | F | USA
Sorry for the long post. I tried to make it as clear as possible. If anyone would like to recreate this, xlsx files a
and b
as well as the r code can be found here: MyDropbox
If anyone has other suggestions on how to approach this topic is welcome to present them. Thank you for the help.
Upvotes: 0
Views: 198
Reputation: 4999
Why not something along the lines of
library(stringr)
library(dplyr)
a <- a %>%
# Extract first and last names into new variables
mutate(First_Name = str_extract(Name, "^[A-z]+"),
Last_Name = str_extract(Name, "[A-z]+$"),)
# Inner Join by first and last name.
# Add a suffix to be able to distinguish the origin of columns.
b %>% inner_join(a, by = c("First_Name", "Last_Name"), suffix = c(".b", ".a")) %>%
# Select the columns you want to see.
# Note that only the colums that have an ambiguous name have a suffix.
select(ID.a, Name, Gender.a, Country, First_Name, Last_Name, ID.b)
Works great if you are looking for exact matches only. If you like, you can also extract the middle name from a string via str_extract(string, "[^A-z]+[A-z]+[^A-z$]")
.
ID.a Name Gender.a Country First_Name Last_Name ID.b
1 3 Dana Benton F USA Dana Benton 2
Expanding from this great post:
library(RecordLinkage)
library(dplyr)
lookup <- expand.grid(target = a$Name, source = b$Whole_Name, stringsAsFactors = FALSE)
lookup %>% group_by(target) %>%
mutate(match_score = jarowinkler(target, source)) %>%
summarise(match = match_score[which.max(match_score)], matched_to = ref[which.max(match_score)]) %>%
inner_join(b, c("matched_to" = "Whole_Name"))
Anything above .8 or .9 should be a good match. Still not perfect. You could try to match first and last name separately, if your data is clean.
Result:# A tibble: 4 x 8
target match matched_to ID First_Name Last_Name Third_Name Gender
<chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr>
1 Dana Benton 1 Dana Benton 2 Dana Benton NA F
2 Ella Collins 0.593 Matt King 4 Matt King NA M
3 James Alan 0.667 Gary Allan Cole 1 Gary Cole Allan M
4 Mattias Adams 0.792 Matt King 4 Matt King NA M
Same as above, just using the Levenshtein distance and which.min()
library(RecordLinkage)
library(dplyr)
lookup <- expand.grid(target = a$Name, source = b$Whole_Name, stringsAsFactors = FALSE)
lookup %>% group_by(target) %>%
mutate(match_score = levenshteinDist(target, source)) %>%
summarise(match = match_score[which.min(match_score)], matched_to = ref[which.min(match_score)]) %>%
inner_join(b, c("matched_to" = "Whole_Name"))
As expected, this gives a poorer performance than JW.
Result:# A tibble: 4 x 8
target match matched_to ID First_Name Last_Name Third_Name Gender
<chr> <int> <chr> <dbl> <chr> <chr> <chr> <chr>
1 Dana Benton 0 Dana Benton 2 Dana Benton NA F
2 Ella Collins 9 Dana Benton 2 Dana Benton NA F
3 James Alan 8 Matt King 4 Matt King NA M
4 Mattias Adams 8 Matt King 4 Matt King NA M
a <- structure(list(ID = c(1, 2, 3, 4), Name = c("Mattias Adams", "James Alan", "Dana Benton", "Ella Collins"), Gender = c("M", "M", "F", "F"), Country = c("UK", "Canada", "USA", "USA")), .Names = c("ID", "Name", "Gender", "Country"), row.names = c(NA, -4L), class = "data.frame")
b <- structure(list(ID = c(1, 2, 3, 4), First_Name = c("Gary", "Dana", "Lena", "Matt"), Last_name = c("Cole", "Benton", "Jamison", "King"), Third_Name = c("Allan", "NA", "Anne", "NA"), Whole_name = c("Gary Allan Cole", "Dana Benton", "Lena Anne Jamison", "Matt King"), Gender = c("M", "F", "F", "M")), .Names = c("ID", "First_Name", "Last_Name", "Third_Name", "Whole_Name", "Gender"), row.names = c(NA, -4L), class = "data.frame")
Upvotes: 1
Reputation: 44987
If you want to avoid the false matches to NA
, don't include it in the pattern. Use this instead:
eup_tn <- paste(na.omit(as.character(b$"Third_name")), collapse = "|")
As to your second question: that's done using the merge()
function in base R, or one of the replacements for it in ?dplyr::join
, probably inner_join()
.
Upvotes: 0