Mishalb
Mishalb

Reputation: 173

Fuzzy Join with Partial String Match in R

I came across something new that I've never done before and I'm hoping to get some help. I'm trying to join two data sets (easy enough job), but there are only partial-string matches in two of the columns. I tried to use fuzzy_join, but I can't get it to work for me. Below is what I'm trying to do. I'm hoping to end up with the data frame called df_final. Any ideas?

df1 <- tribble(
  ~student_id, ~course, ~grade, 
  "001", "social studies grade", "A", 
  "001", "ela grade", "A", 
  "001", "math grade", "A", 
  "002", "social studies grade", "B", 
  "002", "ela grade", "B", 
  "002", "math grade", "B", 
  "003", "social studies grade", "C", 
  "003", "ela grade", "C", 
  "003", "math grade", "C", 
  "004", "social studies grade", "C", 
  "004", "ela grade", "C", 
  "004", "math grade", "C", 
  "005", "social studies grade", "C", 
  "005", "ela grade", "C", 
  "005", "math grade", "C", 
)

df2 <- tribble(
  ~student_id, ~course,
  "001", "5th Social Studies",
  "001", "5th ELA",
  "001", "5th Mathematics",
  "002", "6th Social Studies", 
  "002", "6th ELA",
  "002", "6th Mathematics",
  "003", "8th Social Studies",
  "003", "8th ELA",
  "003", "8th Mathematics",
)

df_final <- tribble(
  ~student_id, ~course, ~grade,
  "001", "5th Social Studies", "A",
  "001", "5th ELA", "A",
  "001", "5th Mathematics", "A",
  "002", "6th Social Studies", "B",
  "002", "6th ELA", "B",
  "002", "6th Mathematics", "B",
  "003", "8th Social Studies", "C",
  "003", "8th ELA", "C",
  "003", "8th Mathematics", "C"
)

Upvotes: 5

Views: 2664

Answers (2)

Konrad
Konrad

Reputation: 18585

I would suggest exploring possibility of using commonly accepted ways of comparing string distances, in my view this constitutes the actual fuzzy approach to matching. Regex-based matching is simply deterministic matching (equally, you could transform your columns first and then do the left join). The solution below uses comparator package to derive matching function returning Jaro-Winkler string distance and undertakes "fuzzy" matching. In your particular example regex may be more appropriate, but in NLP the actual "fuzziness" is usually defined in the context of string distances or a more sophisticated NLP-processing, thing of a situation where you may wish to account for common spelling mistakes, etc. Encapsulating that in a regex will prove onerous.

jw <- comparator::JaroWinkler()

fuzzyjoin::fuzzy_left_join(
    x = df1, y = df2, by = "course",
    match_fun = function(x, y) { jw(x, y) > 0.1}
) 

Example

Your regex will fail miserably if the world grade is mistyped as grade but JaroWinkler string distance will show strong proximity:

>> jw("grade", "grde")
[1] 0.9466667

That's robust if you are dealing with diverse data set where you can't rely on a column having a well-defined pattern.

Upvotes: 5

akrun
akrun

Reputation: 887118

We can use fuzzyjoin. Do a regex_left_join after getting the substring from the 'course' columns in both dataset (to make it more matchable)

library(fuzzyjoin)
library(dplyr)
library(stringr)
df2 %>% 
   mutate(grp = toupper(str_remove(course, "^\\d+th\\s+"))) %>% 
   regex_left_join(df1 %>%
       mutate(grp = toupper(str_remove(course, 
     "\\s+grade$")), course = NULL), by = c('student_id', "grp")) %>% 
   select(student_id = student_id.x, course, grade)

-output

# A tibble: 9 x 3
  student_id course             grade
  <chr>      <chr>              <chr>
1 001        5th Social Studies A    
2 001        5th ELA            A    
3 001        5th Mathematics    A    
4 002        6th Social Studies B    
5 002        6th ELA            B    
6 002        6th Mathematics    B    
7 003        8th Social Studies C    
8 003        8th ELA            C    
9 003        8th Mathematics    C    

OP's expected output is

 df_final
# A tibble: 9 x 3
  student_id course             grade
  <chr>      <chr>              <chr>
1 001        5th Social Studies A    
2 001        5th ELA            A    
3 001        5th Mathematics    A    
4 002        6th Social Studies B    
5 002        6th ELA            B    
6 002        6th Mathematics    B    
7 003        8th Social Studies C    
8 003        8th ELA            C    
9 003        8th Mathematics    C    

Upvotes: 2

Related Questions