lukitamodric
lukitamodric

Reputation: 37

Calculate pairwise similarity of strings in column using tidyverse

If I have a df with the following columns and rows:

row text
1 This sentence is very similar to the next sentence
2 This sentence is not very similar to the next sentence
3 You can't sneeze with your eyes opened
... ...

How can I apply a function that check for every value in the text column whether there is a similar sentence in another row of that column? What I want to do is to then remove rows in which the value of the text column is too similar. For example, how can I ensure that no cell in the column is more than 30, 40, or 80% similar to another string in that same column?

What I want to end up with is the following:

row text
1 This sentence is very similar to the next sentence
3 You can't sneeze with your eyes opened
... ...

Upvotes: 0

Views: 924

Answers (2)

Allan Cameron
Allan Cameron

Reputation: 173813

Here's a solution based on calculating the Levenshtein distance between strings using the levenshteinSim function from the RecordLinkage package, which is reasonably fast:

library(RecordLinkage)

exclude_similar <- function(text, similarity = 0.8) {
  
 sim_mat <- asplit(outer(text, text, levenshteinSim), 1)
 exclude <- unlist(lapply(seq_along(sim_mat), function(x) {
         y <- which(sim_mat[[x]] > similarity)
         y[y > x]
         }
        ))
 answer <- rep(TRUE, length(text))
 answer[exclude] <- FALSE
 return(answer)
}

You would use the function like this:

df[exclude_similar(df$text, similarity = 0.8), ]
#>   row                                               text
#> 1   1 This sentence is very similar to the next sentence
#> 3   3             You can't sneeze with your eyes opened

df[exclude_similar(df$text, similarity = 0.1), ]
#>   row                                               text
#> 1   1 This sentence is very similar to the next sentence

df[exclude_similar(df$text, similarity = 0.95), ]
#>   row                                                   text
#> 1   1     This sentence is very similar to the next sentence
#> 2   2 This sentence is not very similar to the next sentence
#> 3   3                 You can't sneeze with your eyes opened

Created on 2022-02-07 by the reprex package (v2.0.1)


Data used

df <- read.table(text = "row    text
1   \"This sentence is very similar to the next sentence\"
2   \"This sentence is not very similar to the next sentence\"
3   \"You can't sneeze with your eyes opened\"", header = TRUE)

Upvotes: 1

JBGruber
JBGruber

Reputation: 12420

Not the most elegant solution and slow on a large data.frame, but you can use stringdist::stringsim. This can compare text and returns different similarity measures (see the method argument). So given your data:

df <- tibble::tribble(
  ~row, ~text,
  1,    "This sentence is very similar to the next sentence",
  2,    "This sentence is not very similar to the next sentence",
  3,    "You can't sneeze with your eyes opened"
)


stringdist::stringsim(df$text[1], df$text)
#> [1] 1.0000000 0.9259259 0.2800000

We can wrap this in a function to compare every text with all texts that came before and return a logical vector.

library(dplyr)
find_dup <- function(string, thres) {
  purrr::map_lgl(seq_along(string), function(i) {
    sim <- stringdist::stringsim(string[i], string[0:(i - 1)])
    any(sim > thres)
  })
}

Using mutate you can check if the result is correct and then remove the duplicated entries with filter():

df %>% 
  mutate(dup = find_dup(text, 0.8))
#> # A tibble: 3 × 3
#>     row text                                                   dup  
#>   <dbl> <chr>                                                  <lgl>
#> 1     1 This sentence is very similar to the next sentence     FALSE
#> 2     2 This sentence is not very similar to the next sentence TRUE 
#> 3     3 You can't sneeze with your eyes opened                 FALSE

df %>% 
  filter(!find_dup(text, 0.8))
#> # A tibble: 2 × 2
#>     row text                                              
#>   <dbl> <chr>                                             
#> 1     1 This sentence is very similar to the next sentence
#> 2     3 You can't sneeze with your eyes opened

Created on 2022-02-07 by the reprex package (v2.0.1)

Upvotes: 2

Related Questions