Reputation: 387
I have a list of e-mails and I would like to compare patterns (similarity) among the rows using longest common substring to compare them.
data is a data frame with e-mails:
V1
1 "[email protected]"
2 "[email protected]"
3 "[email protected]"
4 "[email protected]"
5 "[email protected]"
6 "[email protected]"
7 "[email protected]"
8 ...
This is my code:
library(stringdist)
for(i in 1:nrow(data)) {
sample <- data[i,]
for(j in (i+1):nrow(data)) if(i+1 <= nrow(data)) {
if((stringdist(data[j,],sample,method='lcs'))<=3) { #number of different characteres 3 (123.456 == 123.321)
duplicate <- data[j,]
email1 = as.character(data[i,])
email2 = as.character(data[j,])
pair <- cbind(email1, email2)
output3[dfrow, ] <- pair
dfrow <- dfrow + 1
}
}
}
and the "outupt" is a data frame showing the similar e-mails.
email1 email2
1 "[email protected]" "[email protected]"
I have 300k e-mails, this will take forever...
Is there a better way to do it?
Thanks!
Upvotes: 0
Views: 136
Reputation: 21621
Here's an attempt:
library(stringdist)
library(stringi)
library(dplyr)
library(tidyr)
# Hypothetical data frame
data <- data.frame(V1 = paste0(stri_rand_strings(5, 3, "[a-z]"),
"@", stri_rand_strings(5, 2, "[a-z]"), ".com"),
stringsAsFactors = FALSE)
Basically you create a string distance pairwise matrix, wrap it in a data frame, replace all string distances that are equal or less than 3 with the corresponding V1
value and the rest with NA
. Then, you remove the now unnecessary V1
column, gather()
the data in a tidy format and remove NA
s.
data %>%
data.frame(stringdistmatrix(.$V1, .$V1, useNames = TRUE, method = "lcs"),
row.names = NULL) %>%
# V1 wnw.fa.com kty.hm.com brs.wk.com pib.uo.com ryu.iq.com
#1 [email protected] 0 10 10 10 10
#2 [email protected] 10 0 10 10 8
#3 [email protected] 10 10 0 8 8
#4 [email protected] 10 10 8 0 10
#5 [email protected] 10 8 8 10 0
# here you need to replace '8' by '3' for your example
mutate_each(funs(ifelse(. <= 8 & . != 0, V1, NA)), -V1) %>%
# V1 wnw.fa.com kty.hm.com brs.wk.com pib.uo.com ryu.iq.com
#1 [email protected] NA <NA> <NA> <NA> <NA>
#2 [email protected] NA <NA> <NA> <NA> [email protected]
#3 [email protected] NA <NA> <NA> [email protected] [email protected]
#4 [email protected] NA <NA> [email protected] <NA> <NA>
#5 [email protected] NA [email protected] [email protected] <NA> <NA>
select(-V1) %>%
gather(email1, email2) %>%
na.omit() %>%
mutate(email1 = stri_replace_first(email1, fixed = ".", "@"))
Which gives:
# email1 email2
#1 [email protected] [email protected]
#2 [email protected] [email protected]
#3 [email protected] [email protected]
#4 [email protected] [email protected]
#5 [email protected] [email protected]
#6 [email protected] [email protected]
Upvotes: 3