DemetriusRPaula
DemetriusRPaula

Reputation: 387

Slow loop R, how make it faster?

I have a list of e-mails and I would like to compare patterns (similarity) among the rows using longest common substring to compare them.

data is a data frame with e-mails:

           V1
1   "[email protected]"
2   "[email protected]"
3   "[email protected]"
4   "[email protected]"
5   "[email protected]"
6   "[email protected]"
7   "[email protected]" 
8   ...

This is my code:

library(stringdist)

for(i in 1:nrow(data)) {
      sample <- data[i,]
      for(j in (i+1):nrow(data)) if(i+1 <= nrow(data)) {
        if((stringdist(data[j,],sample,method='lcs'))<=3) {  #number of different characteres 3 (123.456 == 123.321)
          duplicate <- data[j,]
          email1 = as.character(data[i,])
          email2 = as.character(data[j,])
          pair <- cbind(email1, email2)
          output3[dfrow, ] <- pair
          dfrow <- dfrow + 1
        }
      }
    }

and the "outupt" is a data frame showing the similar e-mails.

         email1          email2
1   "[email protected]" "[email protected]"

I have 300k e-mails, this will take forever...

Is there a better way to do it?

Thanks!

Upvotes: 0

Views: 136

Answers (1)

Steven Beaupr&#233;
Steven Beaupr&#233;

Reputation: 21621

Here's an attempt:

library(stringdist)
library(stringi)
library(dplyr)
library(tidyr)

# Hypothetical data frame     
data <- data.frame(V1 = paste0(stri_rand_strings(5, 3, "[a-z]"), 
                               "@", stri_rand_strings(5, 2, "[a-z]"), ".com"), 
                   stringsAsFactors = FALSE)

Basically you create a string distance pairwise matrix, wrap it in a data frame, replace all string distances that are equal or less than 3 with the corresponding V1 value and the rest with NA. Then, you remove the now unnecessary V1 column, gather() the data in a tidy format and remove NAs.

data %>%
  data.frame(stringdistmatrix(.$V1, .$V1, useNames = TRUE, method = "lcs"), 
             row.names = NULL) %>%

#          V1 wnw.fa.com kty.hm.com brs.wk.com pib.uo.com ryu.iq.com
#1 [email protected]          0         10         10         10         10
#2 [email protected]         10          0         10         10          8
#3 [email protected]         10         10          0          8          8
#4 [email protected]         10         10          8          0         10
#5 [email protected]         10          8          8         10          0

  # here you need to replace '8' by '3' for your example
  mutate_each(funs(ifelse(. <= 8 & . != 0, V1, NA)), -V1) %>% 

#          V1 wnw.fa.com kty.hm.com brs.wk.com pib.uo.com ryu.iq.com
#1 [email protected]         NA       <NA>       <NA>       <NA>       <NA>
#2 [email protected]         NA       <NA>       <NA>       <NA> [email protected]
#3 [email protected]         NA       <NA>       <NA> [email protected] [email protected]
#4 [email protected]         NA       <NA> [email protected]       <NA>       <NA>
#5 [email protected]         NA [email protected] [email protected]       <NA>       <NA>

  select(-V1) %>%
  gather(email1, email2) %>%
  na.omit() %>%
  mutate(email1 = stri_replace_first(email1, fixed = ".", "@"))

Which gives:

#      email1     email2
#1 [email protected] [email protected]
#2 [email protected] [email protected]
#3 [email protected] [email protected]
#4 [email protected] [email protected]
#5 [email protected] [email protected]
#6 [email protected] [email protected]

Upvotes: 3

Related Questions