Fast calculation of the highest similarity score for multi-million-document corpus

Question

I need to find the highest similarity score of a document with all the documents prior to the generation of the document.

I plan to use the quanteda package in R and come up with the following code. dfm is the dfm matrix, which has more than 3 million documents and 4 million features. In each iteration, I compare the target document dfm[id_i,] with all the documents prior to the target document dfm_subset(dfm,date. The resultant similarity score vecotr is stored in one_simil. I can obtain the highest similarity score from max(one_simil,na.rm=T).


Normally,  dfm_subset(dfm,date has more than 1 million documents, so the computation of  one_simil is quite expensive, taking around 1 minute to finish. Since I need to get the highest similarity for around 1 million documents, the total computation time is just too long (about 2,000,000 mins).

I wonder is there any way to speed up the calculation? My thought is that I'm only interested in the highest similarity score, so I do not need to compare dfm[id_i,] with every document in  dfm_subset(dfm,date, therefore, there should exist room for improvement. But I don't know how. Any suggestion is welcomed!

similarity_res=vector("list",nrow(to_find_docs)) #store the result
for(row_i in 1:nrow(to_find_docs)){
  id_i=to_find_docs$id[row_i]
  date_i=to_find_docs$date[row_i]
  
  
  one_simil= textstat_simil(
    dfm_subset(dfm,date

Fast calculation of the highest similarity score for multi-million-document corpus

Answers (1)

Related Questions