justing
justing

Reputation: 29

Is there an R package (or existing function) for fuzzy string detection?

I'm looking to something similar to str_detect() from the stringr package, but which is capable of detecting imperfect or "fuzzy" matches. Preferably, I'd like to be able to specify the degree of imperfection (1 different character, 2 different characters, etc.).

The matching I'm doing will take a form similar to the below code (but this is just a simplified example I made up). In the example, only "RUTH CHRIS" gets matched - I'd like something capable of matching the slightly wrong strings as well.

library(tidyverse)

my_restaurants <- tibble(restaurant = c("MCDOlNALD'S ON FRANKLIN ST",
                                        "NEW JERSEY WENDYS",
                                        "8/25/19 RUTH CHRIS",
                                        "MELTINGPO 9823i3")
)

cheap <- c("MCDONALD'S", "WENDY'S") %>% str_c(collapse="|")
expensive <- c("RUTH CHRIS", "MELTING POT") %>% str_c(collapse="|")

my_restaurants %>%
  mutate(category = case_when(
    str_detect(restaurant, cheap) ~ "CHEAP",
    str_detect(restaurant, expensive) ~ "EXPENSIVE"
    )) 

So again, this gives this output:

##  A tibble: 4 × 2
#   restaurant                 category 
#   <chr>                      <chr>    
# 1 MCDOlNALD'S ON FRANKLIN ST NA       
# 2 NEW JERSEY WENDYS          NA       
# 3 8/25/19 RUTH CHRIS         EXPENSIVE
# 4 MELTINGPOT 9823i3          NA 

But I want:

## A tibble: 4 × 2
#   restaurant                 category 
#   <chr>                      <chr>    
# 1 MCDOlNALD'S ON FRANKLIN ST CHEAP       
# 2 NEW JERSEY WENDYS          CHEAP       
# 3 8/25/19 RUTH CHRIS         EXPENSIVE
# 4 MELTINGPOT 9823i3          EXPENSIVE 

I'm not against using regex, but my actual data is significantly more complicated than the given example, so I'd prefer something much more concise that allows for general, not specifc, types of fuzziness.

Upvotes: 2

Views: 209

Answers (3)

justing
justing

Reputation: 29

The top response to this question clued me in to try agrepl(), which seems to best suit my needs for this project since it is a straightforward substitute for str_detect().

Using my example from above...

my_restaurants <- tibble(restaurant = c("MCDOlNALD'S ON FRANKLIN ST",
                                        "NEW JERSEY WENDYS",
                                        "8/25/19 RUTH CHRIS",
                                        "MELTINGPO 9823i3")
)

cheap <- c("MCDONALD'S", "WENDY'S") %>% str_c(collapse="|")
expensive <- c("RUTH CHRIS", "MELTING POT") %>% str_c(collapse="|")

my_restaurants %>%
  mutate(category = case_when(
    agrepl(cheap, restaurant, 2, fixed=FALSE) ~ "CHEAP",
    agrepl(expensive, restaurant, 2, fixed=FALSE) ~ "EXPENSIVE"
  ))

Gives the output:

# A tibble: 4 × 2
  restaurant                 category 
  <chr>                      <chr>    
1 MCDOlNALD'S ON FRANKLIN ST CHEAP    
2 NEW JERSEY WENDYS          CHEAP    
3 8/25/19 RUTH CHRIS         EXPENSIVE
4 MELTINGPO 9823i3           EXPENSIVE

However, onyambu's solutions also seem to be good alternative methods. They allow for more advanced forms of fuzzy matching than agrepl() is capable of.

Upvotes: 0

Onyambu
Onyambu

Reputation: 79228

In Base R, You could do:

cheap <- c("MCDONALD'S", "WENDY'S") 
expensive <- c("RUTH CHRIS", "MELTING POT")

pat <- stack(list(cheap = cheap, expensive = expensive))

transform(my_restaurants, category=pat[sapply(pat$values,agrep,restaurant),2])

                  restaurant  category
1 MCDOlNALD'S ON FRANKLIN ST     cheap
2          NEW JERSEY WENDYS     cheap
3         8/25/19 RUTH CHRIS expensive
4           MELTINGPO 9823i3 expensive

Upvotes: 4

Onyambu
Onyambu

Reputation: 79228

You can use fuzzyjoin::stringdist_left_join

cheap <- c("MCDONALD'S", "WENDY'S") 
expensive <- c("RUTH CHRIS", "MELTING POT")

pat <- stack(list(cheap = cheap, expensive = expensive))

fuzzyjoin::stringdist_left_join(my_restaurants, pat, 
      c(restaurant='values'), max_dist=0.45, method = 'jaccard')

# A tibble: 4 x 3
  restaurant                 values      ind      
  <chr>                      <chr>       <fct>    
1 MCDOlNALD'S ON FRANKLIN ST MCDONALD'S  cheap    
2 NEW JERSEY WENDYS          WENDY'S     cheap    
3 8/25/19 RUTH CHRIS         RUTH CHRIS  expensive
4 MELTINGPO 9823i3           MELTING POT expensive

Upvotes: 3

Related Questions