Reputation: 93
I would like to use R to find the cosine similarity of one sentence with many others. For example:
s1 <- "The book is on the table"
s2 <- "The pen is on the table"
s3 <- "Put the pen on the book"
s4 <- "Take the book and pen"
sn <- "Take the book and pen from the table"
I want to find the cosine similarity of s1
, s2
, s3
and s4
with sn
. I understand that I have to use vectors (convert the sentences into vectors and use TF-IDF and/or dot product) but since I'm relatively new to R, I'm having a problem implementing it.
Would appreciate all help.
Upvotes: 3
Views: 1838
Reputation: 8837
The cosine dissimilarity used by stringdist
isn't based on words, or terms, but qgrams, which are sequences of q characters, which might or might not form words. We can intuitively see that there's something wrong with the output given in Rui's answer. The only difference between the two first sentences is pen and book, while the last sentence contains both of these words once, so we'd expect the s1
–sn
and s2
–sn
dissimilarities to be identical, which they aren't.
There are probably other R libraries that can compute more conventional cosine similarities, but it's also not too hard to do it ourselves, from first principle. And it might end up more educational.
sv <- c(s1=s1, s2=s2, s3=s3, s4=s4, sn=sn)
# Split sentences into words
svs <- strsplit(tolower(sv), "\\s+")
# Calculate term frequency tables (tf)
termf <- table(stack(svs))
# Calculate inverse document frequencies (idf)
idf <- log(1/rowMeans(termf != 0))
# Multiply to get tf-idf
tfidf <- termf*idf
# Calculate dot products between the last tf-idf and all the previous
dp <- t(tfidf[,5]) %*% tfidf[,-5]
# Divide by the product of the euclidean norms do get the cosine similarity
cosim <- dp/(sqrt(colSums(tfidf[,-5]^2))*sqrt(sum(tfidf[,5]^2)))
cosim
# [,1] [,2] [,3] [,4]
# [1,] 0.1215616 0.1215616 0.02694245 0.6198245
Upvotes: 4
Reputation: 76402
The best way to do what the question asks for is to use package stringdist
.
library(stringdist)
stringdist(sn, c(s1, s2, s3, s4), method = "cosine")
#[1] 0.06479426 0.08015590 0.09776951 0.04376841
In the case where the strings' names have an obvious pattern, such as the ones in the question, mget
can be of use, there will be no need to hard code the strings names one by one in the call to stringdist
.
s_vec <- unlist(mget(ls(pattern = "^s\\d+")))
stringdist(sn, s_vec, method = "cosine")
#[1] 0.06479426 0.08015590 0.09776951 0.04376841
Upvotes: 2