Christopher Penn
Christopher Penn

Reputation: 539

Merging two dataframes by stringmatch with dplyr and stringdist

I'm attempting to do a dplyr left join on two dataframes based on greatly similar language (that's not exact).

DF1:

title | records
Bob's show, part 1 | 42
Time for dinner | 77
Horsecrap | 121

DF2:

showname | counts
Bob's show part 1 | 772
Dinner time | 89
No way Jose | 123

I execute this to get the string distances as a vector using the stringdist package/library:

titlematch <- amatch(df1$title,df2$showname) 

The vector looks like... well, an integer vector:

titlematch
1
2
NA

Normally if I had exact matches I'd do:

blended <- left_join(df1, df2, by = c("title" = "showname"))

How do I do the left join using the vector as the record selector so that the end outcome would be:

title | records | showname | counts
Bob's show, part 1 | 42 | Bob's show part 1 | 772
Time for dinner | 77 | Dinner time | 89

With the third non-match excluded because there's not a probable match in the vector (NA).

Upvotes: 2

Views: 720

Answers (2)

Christopher Penn
Christopher Penn

Reputation: 539

camille suggested in a comment:

Have you looked at fuzzyjoin?

I had never heard of fuzzyjoin before, but I tried it and loved it. stringdist_left_join was exactly what I needed.

Upvotes: 1

zack
zack

Reputation: 5405

Here's a shot,

library(stringdist)
library(tidyverse)

df1 %>%
  as_tibble() %>%
  mutate(temp = amatch(title, df2$showname, maxDist = 10)) %>%
  bind_cols(df2[.$temp, ]) %>%
  select(-temp)

# A tibble: 3 x 4
  title              records showname          counts
  <chr>                <int> <chr>              <int>
1 Bob's show, part 1      42 Bob's show part 1    772
2 Time for dinner         77 Dinner time           89
3 Horsecrap              121 Dinner time           89

I'm unable to reproduce your numeric matching vector, amatch(df1$title, df2$showname) gives me [1] NA NA NA as it looks like the default is 0.1, so I set maxDist to 10.

At the end, you could always add %>% filter(is.na(showname)) to remove any row which had no match.

data

df1 <- structure(list(title = c("Bob's show, part 1", "Time for dinner", 
"Horsecrap"), records = c(42L, 77L, 121L)), .Names = c("title", 
"records"), row.names = c(NA, -3L), class = "data.frame")

df2 <- structure(list(showname = c("Bob's show part 1", "Dinner time", 
"No way Jose"), counts = c(772L, 89L, 123L)), .Names = c("showname", 
"counts"), row.names = c(NA, -3L), class = "data.frame")

Upvotes: 2

Related Questions