Reputation: 123
I am going to calculate similarity between almost 14 thousand documents. But code is taking too much time for execution. Is there any other way to do same work faster?
Here is my code
wb=createWorkbook() #create workbook
addWorksheet(wb,"absSim") #create worksheet
listoffiles=list.files() #get list of documents from current working directory
fileslength=length(listoffiles) #no of documents in directory
for(i in 1:fileslength-1)
{
d1=readLines(listoffiles[i])# read first document
k=i+1
for(j in k:fileslength)
{
d2=readLines(listoffiles[j]) #read second document
#make a vector of two documents
myvector=c(d1,d2)
#making corpus of two documents
mycorpus=Corpus(VectorSource(myvector))
#preprocessing of corpus
mycorpus=tm_map(mycorpus,removePunctuation)
mycorpus=tm_map(mycorpus,removeNumbers)
mycorpus=tm_map(mycorpus,stripWhitespace)
mycorpus=tm_map(mycorpus,tolower)
mycorpus=tm_map(mycorpus,function(x) removeWords(x,stopwords("english")))
mycorpus=tm_map(mycorpus,function(x) removeWords(x,"x"))
#make a document term matrix now
dtm=as.matrix(DocumentTermMatrix(mycorpus))
#compute distance of both documents using proxy package
cdist=as.matrix(dist(dtm,method = "cosine"))
jdist=as.matrix(dist(dtm,method = "jaccard"))
#compute similarity
csim=1-cdist
jsim=1-jdist
#get similarity of both documents
cos=csim[1,2]
jac=jsim[1,2]
if(cos>0 | jac>0)
{
writeData(wb,"absSim",cos,startCol = 1,startRow = rownum)
writeData(wb,"absSim",jac,startCol = 2,startRow = rownum)
saveWorkbook(wb,"abstractSimilarity.xlsx",overwrite = TRUE)
rownum=rownum+1
}
}
}
When I run this code, the first document executed in 2 hr. Is there any idea to calculate cosine and jaccard similarity faster?
Upvotes: 0
Views: 1022
Reputation: 2206
You might try the following code. It is a very simplified version without any cleaning or pruning just to demonstrate how to use text2vec
. I have also used the tokenizers
package for tokenization, since its a bit faster than the tokenizer in text2vec
. I used the sampling function that was provided by Zach for this question/answer. On my machine it completes in less than a minute. Of course, other similarity measures or integration of pre-processing are possible. I hope this is what you are looking for.
library(text2vec)
library(tokenizers)
samplefun <- function(n, x, collapse){
paste(sample(x, n, replace=TRUE), collapse=collapse)
}
words <- sapply(rpois(10000, 8) + 1, samplefun, letters, '')
#14000 documents, each with 100 lines (pasted together) of several words
docs <- sapply(1:14000, function(x) {
paste(sapply(rpois(100, 5) + 1, samplefun, words, ' '), collapse = ". ")
})
iterator <- itoken(docs,
,tokenizer = function(x) tokenizers::tokenize_words(x, lowercase = FALSE)
,progressbar = FALSE
)
vocabulary <- create_vocabulary(iterator)
dtm <- create_dtm(iterator, vocab_vectorizer(vocabulary))
#dtm
#14000 x 10000 sparse Matrix of class "dgCMatrix"
#....
#use, e.g., the first and second half of the dtm as document sets
similarity <- sim2(dtm[1:(nrow(dtm)/2),]
, dtm[(nrow(dtm)/2+1):nrow(dtm),]
, method = "jaccard"
, norm = "none")
dim(similarity)
#[1] 7000 7000
Upvotes: 1