ALEX MATHEW
ALEX MATHEW

Reputation: 261

Using Cosine similarity in String vector to filter out similar strings

I have a string vector. Some strings (could be more than two) of the vector are similar to each other in terms of the words they contain. I want to filter out strings which have a cosine similarity of more than 30% with any other string of the vector. Of the two strings being compared, I wish to keep the string with more words. That is, I want as result only those strings which have less than 30% similarity with any string of the original vector. My aim is to filter out similar strings to keep only approximately distinct strings.

Ex. Vector is :

x <- c("Dan is a good man and very smart", "A good man is rare", "Alex can be trusted with anything", "Dan likes to share his food", "Rare are man who can be trusted", "Please share food")

Result should give (assuming a less than 30% similarity):

c("Dan is a good man and very smart", "Dan likes to share his food", "Rare are man who can be trusted")

The above result has not been verified.

Cosine Code I am using : 

CSString_vector <- c("String One","String Two")
    corp <- tm::VCorpus(VectorSource(CSString_vector))
    controlForMatrix <- list(removePunctuation = TRUE,wordLengths = c(1, Inf),
    weighting = weightTf)
    dtm <- DocumentTermMatrix(corp,control = controlForMatrix)
    matrix_of_vector = as.matrix(dtm)
    res <- lsa::cosine(matrix_of_vector[1,], matrix_of_vector[2,])

I am working in RStudio.

Upvotes: 5

Views: 1783

Answers (1)

JanLauGe
JanLauGe

Reputation: 2335

So, to rephrase what you want: You'd like to calculate the pairwise similarities for all string pairs. You would then like to use that similarity matrix to identify groups of strings that are dissimilar enough to form distinct groups. For each of these groups, you want to drop all but the longest string and return that. Did I get that right?

After some experimenting, here is my proposed solution, step by step:

  • calculate the similarity matrix and binarize it using the threshold value
  • identify distinct groups (cliques) using a graph algorithm from the igraph package
  • find all strings in each clique and retain the longest string

NB: I had to adjust the threshold to 0.4 to make your example work.


Similarity Matrix

This is heavily based on the code you provided, but I packed it up as a function and used the tidyverse to make the code, at least to my taste, a little bit more readable.

library(tm)
library(lsa)
library(tidyverse)

get_cos_sim <- function(corpus) {
  # pre-process corpus
  doc <- corpus %>%
    VectorSource %>%
    tm::VCorpus()
  # get term frequency matrix
  tfm <- doc %>%
    DocumentTermMatrix(
      control = corpus %>% list(
        removePunctuation = TRUE,
        wordLengths = c(1, Inf),
        weighting = weightTf)) %>%
    as.matrix()
  # get row-wise similarity
  sim <- NULL
  for(i in 1:nrow(tfm)) {
    sim_i <- apply(
      X = tfm, 
      MARGIN = 1, 
      FUN = lsa::cosine, 
      tfm[i,])
    sim <- rbind(sim, sim_i)
  }
  # set identity diagonal to zero
  diag(sim) <- 0
  # label and return
  rownames(sim) <- corpus
  return(sim)
}

Now we apply this function to your example data

# example corpus
strings <- c(
  "Dan is a good man and very smart", 
  "A good man is rare", 
  "Alex can be trusted with anything", 
  "Dan likes to share his food", 
  "Rare are man who can be trusted", 
  "Please share food")

# get pairwise similarities
sim <- get_cos_sim(strings)
# binarize (using a different threshold to make your example work)
sim <- sim > .4  

Identify Distinct Groups

This turned out to be an interesting problem! I found this paper, Chalermsook & Chuzhoy: Maximum Independent Set of Rectangles, that led me to this implementation in the igraph package. Basically, we treat similar strings as connected vertices in a graph and then look for distinct groups in the graph of the whole similarity matrix

library(igraph)

# create graph from adjacency matrix
cliques <- sim %>% 
  dplyr::as_data_frame() %>%
  mutate(from = row_number()) %>% 
  gather(key = 'to', value = 'edge', -from) %>% 
  filter(edge == T) %>%
  graph_from_data_frame(directed = FALSE) %>%
  max_cliques()

Find Longest String

Now we can use the list of cliques to retrieve the strings for each of the vertices and pick the longest string per clique. Caveat: strings that have no similar strings in the corpus are missing from the graph. I am adding them back in manually. There might be a function in the igraph package that's better at dealing with it, would be interested if anyone finds something

# get the string indices per vertex clique first
string_cliques_index <- cliques %>% 
  unlist %>%
  names %>%
  as.numeric
# find the indices that are distinct but not in a clique
# (i.e. unconnected vertices)
string_uniques_index <- colnames(sim)[!colnames(sim) %in% string_cliques_index] %>%
  as.numeric
# get a list with all indices
all_distict <- cliques %>% 
  lapply(names) %>% 
  lapply(as.numeric) %>%
  c(string_uniques_index)
# get a list of distinct strings
lapply(all_distict, find_longest, strings)  

Test case:

Let's test this with a longer vector of different strings:

strings <- c(
  "Dan is a good man and very smart", 
  "A good man is rare", 
  "Alex can be trusted with anything", 
  "Dan likes to share his food", 
  "Rare are man who can be trusted", 
  "Please share food",
  "NASA is a government organisation",
  "The FBI organisation is part of the government of USA",
  "Hurricanes are a tragedy",
  "Mangoes are very tasty to eat ",
  "I like to eat tasty food",
  "The thief was caught by the FBI")

I get this binarized similarity matrix:

Dan is a good man and very smart                      FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
A good man is rare                                     TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Alex can be trusted with anything                     FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Dan likes to share his food                           FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Rare are man who can be trusted                       FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Please share food                                     FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
NASA is a government organisation                     FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The FBI organisation is part of the government of USA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
Hurricanes are a tragedy                              FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Mangoes are very tasty to eat                         FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
I like to eat tasty food                              FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
The thief was caught by the FBI                       FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

Based on these similarities, the expected outcome would be:

# included
Dan is a good man and very smart
Alex can be trusted with anything
Dan likes to share his food
NASA is a government organisation
The FBI organisation is part of the government of USA
Hurricanes are a tragedy
Mangoes are very tasty to eat

# omitted
A good man is rare
Rare are man who can be trusted
Please share food
I like to eat tasty food
The thief was caught by the FBI

The actual output has the right elements, but not in the original order. You can reorder using your original string vector though

[[1]]
[1] "The FBI organisation is part of the government of USA"

[[2]]
[1] "Dan is a good man and very smart"

[[3]]
[1] "Alex can be trusted with anything"

[[4]]
[1] "Dan likes to share his food"

[[5]]
[1] "Mangoes are very tasty to eat "

[[6]]
[1] "NASA is a government organisation"

[[7]]
[1] "Hurricanes are a tragedy"

That's all! Hope it is what you were looking for and may be useful to others.

Upvotes: 2

Related Questions