Reputation: 37
If I have a df
with the following columns and rows:
row | text |
---|---|
1 | This sentence is very similar to the next sentence |
2 | This sentence is not very similar to the next sentence |
3 | You can't sneeze with your eyes opened |
... | ... |
How can I apply a function that check for every value in the text
column whether there is a similar sentence in another row of that column? What I want to do is to then remove rows in which the value of the text
column is too similar. For example, how can I ensure that no cell in the column is more than 30, 40, or 80% similar to another string in that same column?
What I want to end up with is the following:
row | text |
---|---|
1 | This sentence is very similar to the next sentence |
3 | You can't sneeze with your eyes opened |
... | ... |
Upvotes: 0
Views: 924
Reputation: 173813
Here's a solution based on calculating the Levenshtein distance between strings using the levenshteinSim
function from the RecordLinkage
package, which is reasonably fast:
library(RecordLinkage)
exclude_similar <- function(text, similarity = 0.8) {
sim_mat <- asplit(outer(text, text, levenshteinSim), 1)
exclude <- unlist(lapply(seq_along(sim_mat), function(x) {
y <- which(sim_mat[[x]] > similarity)
y[y > x]
}
))
answer <- rep(TRUE, length(text))
answer[exclude] <- FALSE
return(answer)
}
You would use the function like this:
df[exclude_similar(df$text, similarity = 0.8), ]
#> row text
#> 1 1 This sentence is very similar to the next sentence
#> 3 3 You can't sneeze with your eyes opened
df[exclude_similar(df$text, similarity = 0.1), ]
#> row text
#> 1 1 This sentence is very similar to the next sentence
df[exclude_similar(df$text, similarity = 0.95), ]
#> row text
#> 1 1 This sentence is very similar to the next sentence
#> 2 2 This sentence is not very similar to the next sentence
#> 3 3 You can't sneeze with your eyes opened
Created on 2022-02-07 by the reprex package (v2.0.1)
Data used
df <- read.table(text = "row text
1 \"This sentence is very similar to the next sentence\"
2 \"This sentence is not very similar to the next sentence\"
3 \"You can't sneeze with your eyes opened\"", header = TRUE)
Upvotes: 1
Reputation: 12420
Not the most elegant solution and slow on a large data.frame, but you can use stringdist::stringsim
. This can compare text and returns different similarity measures (see the method
argument). So given your data:
df <- tibble::tribble(
~row, ~text,
1, "This sentence is very similar to the next sentence",
2, "This sentence is not very similar to the next sentence",
3, "You can't sneeze with your eyes opened"
)
stringdist::stringsim(df$text[1], df$text)
#> [1] 1.0000000 0.9259259 0.2800000
We can wrap this in a function to compare every text with all texts that came before and return a logical vector.
library(dplyr)
find_dup <- function(string, thres) {
purrr::map_lgl(seq_along(string), function(i) {
sim <- stringdist::stringsim(string[i], string[0:(i - 1)])
any(sim > thres)
})
}
Using mutate
you can check if the result is correct and then remove the duplicated entries with filter()
:
df %>%
mutate(dup = find_dup(text, 0.8))
#> # A tibble: 3 × 3
#> row text dup
#> <dbl> <chr> <lgl>
#> 1 1 This sentence is very similar to the next sentence FALSE
#> 2 2 This sentence is not very similar to the next sentence TRUE
#> 3 3 You can't sneeze with your eyes opened FALSE
df %>%
filter(!find_dup(text, 0.8))
#> # A tibble: 2 × 2
#> row text
#> <dbl> <chr>
#> 1 1 This sentence is very similar to the next sentence
#> 2 3 You can't sneeze with your eyes opened
Created on 2022-02-07 by the reprex package (v2.0.1)
Upvotes: 2