Fuzzy Join with Partial String Match in R

Question

I came across something new that I've never done before and I'm hoping to get some help. I'm trying to join two data sets (easy enough job), but there are only partial-string matches in two of the columns. I tried to use fuzzy_join, but I can't get it to work for me. Below is what I'm trying to do. I'm hoping to end up with the data frame called df_final. Any ideas?

df1 <- tribble(
  ~student_id, ~course, ~grade, 
  "001", "social studies grade", "A", 
  "001", "ela grade", "A", 
  "001", "math grade", "A", 
  "002", "social studies grade", "B", 
  "002", "ela grade", "B", 
  "002", "math grade", "B", 
  "003", "social studies grade", "C", 
  "003", "ela grade", "C", 
  "003", "math grade", "C", 
  "004", "social studies grade", "C", 
  "004", "ela grade", "C", 
  "004", "math grade", "C", 
  "005", "social studies grade", "C", 
  "005", "ela grade", "C", 
  "005", "math grade", "C", 
)

df2 <- tribble(
  ~student_id, ~course,
  "001", "5th Social Studies",
  "001", "5th ELA",
  "001", "5th Mathematics",
  "002", "6th Social Studies", 
  "002", "6th ELA",
  "002", "6th Mathematics",
  "003", "8th Social Studies",
  "003", "8th ELA",
  "003", "8th Mathematics",
)

df_final <- tribble(
  ~student_id, ~course, ~grade,
  "001", "5th Social Studies", "A",
  "001", "5th ELA", "A",
  "001", "5th Mathematics", "A",
  "002", "6th Social Studies", "B",
  "002", "6th ELA", "B",
  "002", "6th Mathematics", "B",
  "003", "8th Social Studies", "C",
  "003", "8th ELA", "C",
  "003", "8th Mathematics", "C"
)

akrun · Accepted Answer

We can use fuzzyjoin. Do a regex_left_join after getting the substring from the 'course' columns in both dataset (to make it more matchable)

library(fuzzyjoin)
library(dplyr)
library(stringr)
df2 %>% 
   mutate(grp = toupper(str_remove(course, "^\d+th\s+"))) %>% 
   regex_left_join(df1 %>%
       mutate(grp = toupper(str_remove(course, 
     "\s+grade$")), course = NULL), by = c('student_id', "grp")) %>% 
   select(student_id = student_id.x, course, grade)

-output

# A tibble: 9 x 3
  student_id course             grade
                      
1 001        5th Social Studies A    
2 001        5th ELA            A    
3 001        5th Mathematics    A    
4 002        6th Social Studies B    
5 002        6th ELA            B    
6 002        6th Mathematics    B    
7 003        8th Social Studies C    
8 003        8th ELA            C    
9 003        8th Mathematics    C

OP's expected output is

 df_final
# A tibble: 9 x 3
  student_id course             grade
                      
1 001        5th Social Studies A    
2 001        5th ELA            A    
3 001        5th Mathematics    A    
4 002        6th Social Studies B    
5 002        6th ELA            B    
6 002        6th Mathematics    B    
7 003        8th Social Studies C    
8 003        8th ELA            C    
9 003        8th Mathematics    C

Fuzzy Join with Partial String Match in R

Answers (2)

Example

Related Questions