Reputation: 261
I have a string vector. Some strings (could be more than two) of the vector are similar to each other in terms of the words they contain. I want to filter out strings which have a cosine similarity of more than 30% with any other string of the vector. Of the two strings being compared, I wish to keep the string with more words. That is, I want as result only those strings which have less than 30% similarity with any string of the original vector. My aim is to filter out similar strings to keep only approximately distinct strings.
Ex. Vector is :
x <- c("Dan is a good man and very smart", "A good man is rare", "Alex can be trusted with anything", "Dan likes to share his food", "Rare are man who can be trusted", "Please share food")
Result should give (assuming a less than 30% similarity):
c("Dan is a good man and very smart", "Dan likes to share his food", "Rare are man who can be trusted")
The above result has not been verified.
Cosine Code I am using :
CSString_vector <- c("String One","String Two")
corp <- tm::VCorpus(VectorSource(CSString_vector))
controlForMatrix <- list(removePunctuation = TRUE,wordLengths = c(1, Inf),
weighting = weightTf)
dtm <- DocumentTermMatrix(corp,control = controlForMatrix)
matrix_of_vector = as.matrix(dtm)
res <- lsa::cosine(matrix_of_vector[1,], matrix_of_vector[2,])
I am working in RStudio.
Upvotes: 5
Views: 1783
Reputation: 2335
So, to rephrase what you want: You'd like to calculate the pairwise similarities for all string pairs. You would then like to use that similarity matrix to identify groups of strings that are dissimilar enough to form distinct groups. For each of these groups, you want to drop all but the longest string and return that. Did I get that right?
After some experimenting, here is my proposed solution, step by step:
igraph
packageNB: I had to adjust the threshold to 0.4 to make your example work.
This is heavily based on the code you provided, but I packed it up as a function and used the tidyverse
to make the code, at least to my taste, a little bit more readable.
library(tm)
library(lsa)
library(tidyverse)
get_cos_sim <- function(corpus) {
# pre-process corpus
doc <- corpus %>%
VectorSource %>%
tm::VCorpus()
# get term frequency matrix
tfm <- doc %>%
DocumentTermMatrix(
control = corpus %>% list(
removePunctuation = TRUE,
wordLengths = c(1, Inf),
weighting = weightTf)) %>%
as.matrix()
# get row-wise similarity
sim <- NULL
for(i in 1:nrow(tfm)) {
sim_i <- apply(
X = tfm,
MARGIN = 1,
FUN = lsa::cosine,
tfm[i,])
sim <- rbind(sim, sim_i)
}
# set identity diagonal to zero
diag(sim) <- 0
# label and return
rownames(sim) <- corpus
return(sim)
}
Now we apply this function to your example data
# example corpus
strings <- c(
"Dan is a good man and very smart",
"A good man is rare",
"Alex can be trusted with anything",
"Dan likes to share his food",
"Rare are man who can be trusted",
"Please share food")
# get pairwise similarities
sim <- get_cos_sim(strings)
# binarize (using a different threshold to make your example work)
sim <- sim > .4
This turned out to be an interesting problem! I found this paper, Chalermsook & Chuzhoy: Maximum Independent Set of Rectangles, that led me to this implementation in the igraph
package. Basically, we treat similar strings as connected vertices in a graph and then look for distinct groups in the graph of the whole similarity matrix
library(igraph)
# create graph from adjacency matrix
cliques <- sim %>%
dplyr::as_data_frame() %>%
mutate(from = row_number()) %>%
gather(key = 'to', value = 'edge', -from) %>%
filter(edge == T) %>%
graph_from_data_frame(directed = FALSE) %>%
max_cliques()
Now we can use the list of cliques to retrieve the strings for each of the vertices
and pick the longest string per clique. Caveat: strings that have no similar strings in the corpus are missing from the graph. I am adding them back in manually. There might be a function in the igraph
package that's better at dealing with it, would be interested if anyone finds something
# get the string indices per vertex clique first
string_cliques_index <- cliques %>%
unlist %>%
names %>%
as.numeric
# find the indices that are distinct but not in a clique
# (i.e. unconnected vertices)
string_uniques_index <- colnames(sim)[!colnames(sim) %in% string_cliques_index] %>%
as.numeric
# get a list with all indices
all_distict <- cliques %>%
lapply(names) %>%
lapply(as.numeric) %>%
c(string_uniques_index)
# get a list of distinct strings
lapply(all_distict, find_longest, strings)
Let's test this with a longer vector of different strings:
strings <- c(
"Dan is a good man and very smart",
"A good man is rare",
"Alex can be trusted with anything",
"Dan likes to share his food",
"Rare are man who can be trusted",
"Please share food",
"NASA is a government organisation",
"The FBI organisation is part of the government of USA",
"Hurricanes are a tragedy",
"Mangoes are very tasty to eat ",
"I like to eat tasty food",
"The thief was caught by the FBI")
I get this binarized similarity matrix:
Dan is a good man and very smart FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
A good man is rare TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Alex can be trusted with anything FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Dan likes to share his food FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Rare are man who can be trusted FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Please share food FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
NASA is a government organisation FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The FBI organisation is part of the government of USA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Hurricanes are a tragedy FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Mangoes are very tasty to eat FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
I like to eat tasty food FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
The thief was caught by the FBI FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
Based on these similarities, the expected outcome would be:
# included
Dan is a good man and very smart
Alex can be trusted with anything
Dan likes to share his food
NASA is a government organisation
The FBI organisation is part of the government of USA
Hurricanes are a tragedy
Mangoes are very tasty to eat
# omitted
A good man is rare
Rare are man who can be trusted
Please share food
I like to eat tasty food
The thief was caught by the FBI
The actual output has the right elements, but not in the original order. You can reorder using your original string vector though
[[1]]
[1] "The FBI organisation is part of the government of USA"
[[2]]
[1] "Dan is a good man and very smart"
[[3]]
[1] "Alex can be trusted with anything"
[[4]]
[1] "Dan likes to share his food"
[[5]]
[1] "Mangoes are very tasty to eat "
[[6]]
[1] "NASA is a government organisation"
[[7]]
[1] "Hurricanes are a tragedy"
That's all! Hope it is what you were looking for and may be useful to others.
Upvotes: 2