Alvi
Alvi

Reputation: 123

Calculate cosine and jaccard similarity of collection of document in r

I am going to calculate similarity between almost 14 thousand documents. But code is taking too much time for execution. Is there any other way to do same work faster?

Here is my code

wb=createWorkbook() #create workbook
addWorksheet(wb,"absSim") #create worksheet
listoffiles=list.files() #get list of documents from current working directory
fileslength=length(listoffiles) #no of documents in directory
for(i in 1:fileslength-1)
{
  d1=readLines(listoffiles[i])# read first document 
  k=i+1
  for(j in k:fileslength)
  {
   d2=readLines(listoffiles[j]) #read second document
   #make a vector of two documents
   myvector=c(d1,d2)
   #making corpus of two documents
   mycorpus=Corpus(VectorSource(myvector))
   #preprocessing of corpus
   mycorpus=tm_map(mycorpus,removePunctuation)
   mycorpus=tm_map(mycorpus,removeNumbers)
   mycorpus=tm_map(mycorpus,stripWhitespace)
   mycorpus=tm_map(mycorpus,tolower)
   mycorpus=tm_map(mycorpus,function(x) removeWords(x,stopwords("english")))
   mycorpus=tm_map(mycorpus,function(x) removeWords(x,"x"))
   #make a document term matrix now
   dtm=as.matrix(DocumentTermMatrix(mycorpus))
   #compute distance of both documents using proxy package
   cdist=as.matrix(dist(dtm,method = "cosine"))
   jdist=as.matrix(dist(dtm,method = "jaccard"))
   #compute similarity
   csim=1-cdist
   jsim=1-jdist
   #get similarity of both documents
   cos=csim[1,2]
   jac=jsim[1,2]
   if(cos>0 | jac>0)
   {
     writeData(wb,"absSim",cos,startCol = 1,startRow = rownum)
     writeData(wb,"absSim",jac,startCol = 2,startRow = rownum)
     saveWorkbook(wb,"abstractSimilarity.xlsx",overwrite = TRUE)
     rownum=rownum+1
   }
  }
}

When I run this code, the first document executed in 2 hr. Is there any idea to calculate cosine and jaccard similarity faster?

Upvotes: 0

Views: 1022

Answers (1)

Manuel Bickel
Manuel Bickel

Reputation: 2206

You might try the following code. It is a very simplified version without any cleaning or pruning just to demonstrate how to use text2vec. I have also used the tokenizers package for tokenization, since its a bit faster than the tokenizer in text2vec. I used the sampling function that was provided by Zach for this question/answer. On my machine it completes in less than a minute. Of course, other similarity measures or integration of pre-processing are possible. I hope this is what you are looking for.

library(text2vec)
library(tokenizers)

samplefun <- function(n, x, collapse){
  paste(sample(x, n, replace=TRUE), collapse=collapse)
}

words <- sapply(rpois(10000, 8) + 1, samplefun, letters, '')

#14000 documents, each with 100 lines (pasted together) of several words
docs <- sapply(1:14000, function(x) {

  paste(sapply(rpois(100, 5) + 1, samplefun, words, ' '), collapse = ". ")

})

iterator <- itoken(docs,
                   ,tokenizer = function(x) tokenizers::tokenize_words(x, lowercase = FALSE)
                   ,progressbar = FALSE
                   )

vocabulary <- create_vocabulary(iterator)

dtm <- create_dtm(iterator, vocab_vectorizer(vocabulary))

#dtm
#14000 x 10000 sparse Matrix of class "dgCMatrix"
#....

#use, e.g., the first and second half of the dtm as document sets
similarity <- sim2(dtm[1:(nrow(dtm)/2),]
                   , dtm[(nrow(dtm)/2+1):nrow(dtm),]
                   , method = "jaccard"
                   , norm = "none")

dim(similarity)
#[1] 7000 7000

Upvotes: 1

Related Questions