Ojaswita
Ojaswita

Reputation: 93

Finding the cosine similarity of a sentence with many others in r

I would like to use R to find the cosine similarity of one sentence with many others. For example:

s1 <- "The book is on the table"  
s2 <- "The pen is on the table"  
s3 <- "Put the pen on the book"  
s4 <- "Take the book and pen"  

sn <- "Take the book and pen from the table"  

I want to find the cosine similarity of s1, s2, s3 and s4 with sn. I understand that I have to use vectors (convert the sentences into vectors and use TF-IDF and/or dot product) but since I'm relatively new to R, I'm having a problem implementing it.

Would appreciate all help.

Upvotes: 3

Views: 1838

Answers (2)

AkselA
AkselA

Reputation: 8837

The cosine dissimilarity used by stringdist isn't based on words, or terms, but qgrams, which are sequences of q characters, which might or might not form words. We can intuitively see that there's something wrong with the output given in Rui's answer. The only difference between the two first sentences is pen and book, while the last sentence contains both of these words once, so we'd expect the s1sn and s2sn dissimilarities to be identical, which they aren't.
There are probably other R libraries that can compute more conventional cosine similarities, but it's also not too hard to do it ourselves, from first principle. And it might end up more educational.

sv <- c(s1=s1, s2=s2, s3=s3, s4=s4, sn=sn)

# Split sentences into words
svs <- strsplit(tolower(sv), "\\s+")

# Calculate term frequency tables (tf)
termf <- table(stack(svs))

# Calculate inverse document frequencies (idf)
idf <- log(1/rowMeans(termf != 0))

# Multiply to get tf-idf
tfidf <- termf*idf

# Calculate dot products between the last tf-idf and all the previous
dp <- t(tfidf[,5]) %*% tfidf[,-5]

# Divide by the product of the euclidean norms do get the cosine similarity
cosim <- dp/(sqrt(colSums(tfidf[,-5]^2))*sqrt(sum(tfidf[,5]^2)))
cosim
#           [,1]      [,2]       [,3]      [,4]
# [1,] 0.1215616 0.1215616 0.02694245 0.6198245

Upvotes: 4

Rui Barradas
Rui Barradas

Reputation: 76402

The best way to do what the question asks for is to use package stringdist.

library(stringdist)

stringdist(sn, c(s1, s2, s3, s4), method = "cosine")
#[1] 0.06479426 0.08015590 0.09776951 0.04376841

In the case where the strings' names have an obvious pattern, such as the ones in the question, mget can be of use, there will be no need to hard code the strings names one by one in the call to stringdist.

s_vec <- unlist(mget(ls(pattern = "^s\\d+")))
stringdist(sn, s_vec, method = "cosine")
#[1] 0.06479426 0.08015590 0.09776951 0.04376841

Upvotes: 2

Related Questions