jerry_k
jerry_k

Reputation: 393

Cross comparison of data frames in R

I have to compare two people datasets between each other.

Let's say i have a data frame with few columns a =.

    ID  |     Name        |      Gender      |  Country   
   ——————————————————————————————————————————————————————————
    1   | Mattias Adams   |        M         |    UK           
    2   | James Alan      |        M         |    Canada
    3   | Dana Benton     |        F         |    USA
    4   | Ella Collins    |        F         |    USA

And b =

    ID  | First_Name | Last_name  | Third_name |  Whole_name       |  Gender   
————————————————————————————————————————————————————————————————————————————
    1   |    Gary    |  Cole      |    Allan   | Gary Allan Cole   |    M
    2   |    Dana    |  Benton    |    NA      | Dana Benton       |    F
    3   |    Lena    |  Jamison   |    Anne    | Lena Anne Jamison |    F        
    4   |    Matt    |  King      |    NA      | Matt King         |    M

Data frame a is the bigger one contains around 100,000 rows while b contains less than 1,000.

The goal is to use the data in b to find matching records in a. So that the whole row in a is returned if there is a match.

I want to try two ways. First to find exact matches from b$"Whole_name" in a$"Name".

Exact matching:

    eue_wn <- as.character(b$"Whole_name")
    eue_wn_match <- a[which(as.character(a$"Name") %in% eue_wn),]
        if (nrow(eue_wn_match) == 0) {
             eue_wn_match <- "No matches"
        }

Output of eue_wn_matc in this case would be:

    ID  |     Name        |      Gender      |  Country   
   —————————————————————————————————————————————————————————— 
    3   | Dana Benton     |        F         |    USA

Pattern matching:

    eup_ln <- paste(as.character(b$"Last_name"), collapse = "|")
    eup_fn <- paste(as.character(b$"First_Name"), collapse = "|")
    eup_tn <- paste(as.character(b$"Third_name"), collapse = "|")

    eup_match <- a[which(grepl(eup_ln, as.character(a$"Name"), ignore.case = TRUE)),]         #First filter (last name)
    if (nrow(eup_match) == 0) {
        eup_match <- "No matches"
    } 
    if (nrow(eup_match) > 0) {
        eup_match2 <- eup_match[which(grepl(eup_fn, as.character(eup_match$"Name"), ignore.case = TRUE)),]     #Second filter (first name)
          if (nrow(eup_match2) == 0 ) {
            eup_match2 <- "No matches"
          }
    }

    if (nrow(eup_match2) > 0) {
      eup_match3 <- eup_match2[which(grepl(eup_tn, as.character(eup_match2$"Name"), ignore.case = TRUE)),]     #Third filter (third_name)
      if (nrow(eup_match3) == 0 ) {
        eup_match3 <- "No matches"
      }
    }

So in this process the matching takes place in 3 stages. First eup_match is the result of finding the last name. Than it takes that result and looks for second match which is the first name name results eup_match2 shows record that matches both conditions. Finally the last result is taken and is being matched also with the third name eup_match3

in this case the result of all three of them is the same:

    ID  |     Name        |      Gender      |  Country   
   —————————————————————————————————————————————————————————— 
    3   | Dana Benton     |        F         |    USA

And that is incorrect. Only eup_match and eup_match2 should have that output. Since in the first stage we were matching Dana Benton(a) and Dana(b) In the next stage the match was Dana Benton(a) and Benton (b). And since she does not have a third name it is impossible to match her with third name. The problem is in:

eup_tn <- paste(as.character(b$"Third_name"), collapse = "|")

The output off this is :

"Allan|NA|Anne|NA"

Because the NA was converted to character the function was able to find pattern in a and b. In this particular case Dana Benson (a) and NA (b)

Any idea on how to correct that ?

Another question is related to the output. Is there any way to output both results from a and b

Example: if we are only matching the a$Name with b$First_Name by patterns the result would be

ID  |     Name        |      Gender      |  Country   | Match | Match ID
———————————————————————————————————————————————————————————————————————————
1   | Mattias Adams   |        M         |    UK      | Matt  |    4 
3   | Dana Benton     |        F         |    USA     | Dana  |    2

So that the first 4 columns are from the data set a and the last two from b Columns Match | Match ID would show based on what were the records in b matched.

The desired output for the test example given would be:

    ID  |     Name        |      Gender      |  Country   
   —————————————————————————————————————————————————————————— 
    3   | Dana Benton     |        F         |    USA

Sorry for the long post. I tried to make it as clear as possible. If anyone would like to recreate this, xlsx files a and b as well as the r code can be found here: MyDropbox

If anyone has other suggestions on how to approach this topic is welcome to present them. Thank you for the help.

Upvotes: 0

Views: 198

Answers (2)

Roman
Roman

Reputation: 4999

Approach #1: Exact match

Why not something along the lines of

library(stringr)
library(dplyr)
a <- a %>%
    # Extract first and last names into new variables
    mutate(First_Name = str_extract(Name, "^[A-z]+"),
           Last_Name = str_extract(Name, "[A-z]+$"),)

# Inner Join by first and last name.
# Add a suffix to be able to distinguish the origin of columns.
b %>% inner_join(a, by = c("First_Name", "Last_Name"), suffix = c(".b", ".a")) %>%
    # Select the columns you want to see.
    # Note that only the colums that have an ambiguous name have a suffix.
    select(ID.a, Name, Gender.a, Country, First_Name, Last_Name, ID.b)

Works great if you are looking for exact matches only. If you like, you can also extract the middle name from a string via str_extract(string, "[^A-z]+[A-z]+[^A-z$]").

Result:
  ID.a        Name Gender.a Country First_Name Last_Name ID.b
1    3 Dana Benton        F     USA       Dana    Benton    2

Approach #2: Word distance (Jaro-Winkler)

Expanding from this great post:

library(RecordLinkage)
library(dplyr)

lookup <- expand.grid(target = a$Name, source = b$Whole_Name, stringsAsFactors = FALSE)

lookup %>% group_by(target) %>%
    mutate(match_score = jarowinkler(target, source)) %>%
    summarise(match = match_score[which.max(match_score)], matched_to = ref[which.max(match_score)]) %>%
    inner_join(b, c("matched_to" = "Whole_Name"))

Anything above .8 or .9 should be a good match. Still not perfect. You could try to match first and last name separately, if your data is clean.

Result:
# A tibble: 4 x 8
  target        match matched_to         ID First_Name Last_Name Third_Name Gender
  <chr>         <dbl> <chr>           <dbl> <chr>      <chr>     <chr>      <chr> 
1 Dana Benton   1     Dana Benton         2 Dana       Benton    NA         F     
2 Ella Collins  0.593 Matt King           4 Matt       King      NA         M     
3 James Alan    0.667 Gary Allan Cole     1 Gary       Cole      Allan      M     
4 Mattias Adams 0.792 Matt King           4 Matt       King      NA         M     


Approach #3: Word distance (Levenshtein)

Same as above, just using the Levenshtein distance and which.min()

library(RecordLinkage)
library(dplyr)


lookup <- expand.grid(target = a$Name, source = b$Whole_Name, stringsAsFactors = FALSE)

lookup %>% group_by(target) %>%
    mutate(match_score = levenshteinDist(target, source)) %>%
    summarise(match = match_score[which.min(match_score)], matched_to = ref[which.min(match_score)]) %>%
    inner_join(b, c("matched_to" = "Whole_Name"))

As expected, this gives a poorer performance than JW.

Result:
# A tibble: 4 x 8
  target        match matched_to     ID First_Name Last_Name Third_Name Gender
  <chr>         <int> <chr>       <dbl> <chr>      <chr>     <chr>      <chr> 
1 Dana Benton       0 Dana Benton     2 Dana       Benton    NA         F     
2 Ella Collins      9 Dana Benton     2 Dana       Benton    NA         F     
3 James Alan        8 Matt King       4 Matt       King      NA         M     
4 Mattias Adams     8 Matt King       4 Matt       King      NA         M     


Data

a <- structure(list(ID = c(1, 2, 3, 4), Name = c("Mattias Adams", "James Alan", "Dana Benton", "Ella Collins"), Gender = c("M", "M", "F", "F"), Country = c("UK", "Canada", "USA", "USA")), .Names = c("ID", "Name", "Gender", "Country"), row.names = c(NA, -4L), class = "data.frame")
b <- structure(list(ID = c(1, 2, 3, 4), First_Name = c("Gary", "Dana", "Lena", "Matt"), Last_name = c("Cole", "Benton", "Jamison", "King"), Third_Name = c("Allan", "NA", "Anne", "NA"), Whole_name = c("Gary Allan Cole", "Dana Benton", "Lena Anne Jamison", "Matt King"), Gender = c("M", "F", "F", "M")), .Names = c("ID", "First_Name", "Last_Name", "Third_Name", "Whole_Name", "Gender"), row.names = c(NA, -4L), class = "data.frame")

Upvotes: 1

user2554330
user2554330

Reputation: 44987

If you want to avoid the false matches to NA, don't include it in the pattern. Use this instead:

eup_tn <- paste(na.omit(as.character(b$"Third_name")), collapse = "|")

As to your second question: that's done using the merge() function in base R, or one of the replacements for it in ?dplyr::join, probably inner_join().

Upvotes: 0

Related Questions