Slow loop R, how make it faster?

Question

I have a list of e-mails and I would like to compare patterns (similarity) among the rows using longest common substring to compare them.

data is a data frame with e-mails:

           V1
1   "01003@163.com"
2   "cloud@coldmail.com"
3   "den_smukk_kiilar@hotmail.com"
4   "Esteban.verduzco@gmail.com"
5   "freiheitmensch@gmail.com"
6   "mitsoanastos@yahoo.com"
7   "ahmedsir744@yahoo.com" 
8   ...

This is my code:

library(stringdist)

for(i in 1:nrow(data)) {
      sample <- data[i,]
      for(j in (i+1):nrow(data)) if(i+1 <= nrow(data)) {
        if((stringdist(data[j,],sample,method='lcs'))<=3) {  #number of different characteres 3 (123.456 == 123.321)
          duplicate <- data[j,]
          email1 = as.character(data[i,])
          email2 = as.character(data[j,])
          pair <- cbind(email1, email2)
          output3[dfrow, ] <- pair
          dfrow <- dfrow + 1
        }
      }
    }

and the "outupt" is a data frame showing the similar e-mails.

         email1          email2
1   "01079@163.com" "01069@163.com"

I have 300k e-mails, this will take forever...

Is there a better way to do it?

Thanks!

Slow loop R, how make it faster?

Answers (1)

Related Questions