Reputation: 173
I came across something new that I've never done before and I'm hoping to get some help. I'm trying to join two data sets (easy enough job), but there are only partial-string matches in two of the columns. I tried to use fuzzy_join, but I can't get it to work for me. Below is what I'm trying to do. I'm hoping to end up with the data frame called df_final. Any ideas?
df1 <- tribble(
~student_id, ~course, ~grade,
"001", "social studies grade", "A",
"001", "ela grade", "A",
"001", "math grade", "A",
"002", "social studies grade", "B",
"002", "ela grade", "B",
"002", "math grade", "B",
"003", "social studies grade", "C",
"003", "ela grade", "C",
"003", "math grade", "C",
"004", "social studies grade", "C",
"004", "ela grade", "C",
"004", "math grade", "C",
"005", "social studies grade", "C",
"005", "ela grade", "C",
"005", "math grade", "C",
)
df2 <- tribble(
~student_id, ~course,
"001", "5th Social Studies",
"001", "5th ELA",
"001", "5th Mathematics",
"002", "6th Social Studies",
"002", "6th ELA",
"002", "6th Mathematics",
"003", "8th Social Studies",
"003", "8th ELA",
"003", "8th Mathematics",
)
df_final <- tribble(
~student_id, ~course, ~grade,
"001", "5th Social Studies", "A",
"001", "5th ELA", "A",
"001", "5th Mathematics", "A",
"002", "6th Social Studies", "B",
"002", "6th ELA", "B",
"002", "6th Mathematics", "B",
"003", "8th Social Studies", "C",
"003", "8th ELA", "C",
"003", "8th Mathematics", "C"
)
Upvotes: 5
Views: 2664
Reputation: 18585
I would suggest exploring possibility of using commonly accepted ways of comparing string distances, in my view this constitutes the actual fuzzy approach to matching. Regex-based matching is simply deterministic matching (equally, you could transform your columns first and then do the left join). The solution below uses comparator
package to derive matching function returning Jaro-Winkler string distance and undertakes "fuzzy" matching. In your particular example regex may be more appropriate, but in NLP the actual "fuzziness" is usually defined in the context of string distances or a more sophisticated NLP-processing, thing of a situation where you may wish to account for common spelling mistakes, etc. Encapsulating that in a regex will prove onerous.
jw <- comparator::JaroWinkler()
fuzzyjoin::fuzzy_left_join(
x = df1, y = df2, by = "course",
match_fun = function(x, y) { jw(x, y) > 0.1}
)
Your regex will fail miserably if the world grade
is mistyped as grade
but JaroWinkler
string distance will show strong proximity:
>> jw("grade", "grde")
[1] 0.9466667
That's robust if you are dealing with diverse data set where you can't rely on a column having a well-defined pattern.
Upvotes: 5
Reputation: 887118
We can use fuzzyjoin
. Do a regex_left_join
after getting the substring from the 'course' columns in both dataset (to make it more matchable)
library(fuzzyjoin)
library(dplyr)
library(stringr)
df2 %>%
mutate(grp = toupper(str_remove(course, "^\\d+th\\s+"))) %>%
regex_left_join(df1 %>%
mutate(grp = toupper(str_remove(course,
"\\s+grade$")), course = NULL), by = c('student_id', "grp")) %>%
select(student_id = student_id.x, course, grade)
-output
# A tibble: 9 x 3
student_id course grade
<chr> <chr> <chr>
1 001 5th Social Studies A
2 001 5th ELA A
3 001 5th Mathematics A
4 002 6th Social Studies B
5 002 6th ELA B
6 002 6th Mathematics B
7 003 8th Social Studies C
8 003 8th ELA C
9 003 8th Mathematics C
OP's expected output is
df_final
# A tibble: 9 x 3
student_id course grade
<chr> <chr> <chr>
1 001 5th Social Studies A
2 001 5th ELA A
3 001 5th Mathematics A
4 002 6th Social Studies B
5 002 6th ELA B
6 002 6th Mathematics B
7 003 8th Social Studies C
8 003 8th ELA C
9 003 8th Mathematics C
Upvotes: 2