Reputation: 2940
I am trying to identify likely last name from parts of name strings in various formats in R. What is the fastest way to identify the longest string match from the dataset of last names to a given name string (I'm using the wru surnames2010 dataset)?
I need the longest possibility rather than any possibility. I.e. in the example below the first string "scottcampbell" contains possible surnames "scott" and "campbell". I want to only return the longest of the possible matches, in this case only "campbell".
Reproduce example data:
library(wru)
data("surnames2010")
#filter out names under 4 characters
lnames <- surnames2010[nchar(as.character(surnames2010$surname))>3,]
testvec <- c("scottcampbell","mattbaker","tsmith","watkins","burnsmary","terri","frankrodriguez","neal")
Desired imagined function+result:
foo_longest_matches(testvec)
#Desired imagined result:
[1] "campbell" "baker" "smith" "watkins" "burns" "terri" "rodriguez" "neal")
Upvotes: 0
Views: 392
Reputation: 79318
You could use adist
. Please note that you are doing more than 1million comparisons to obtain the longest. I would prefer you use a different method. The best so far that I have in mind is
a <- adist(toupper(testvec), surnames2010$surname, counts = TRUE)
b <- attr(a, "trafos")
d <- array(grepl("S|I", b) + nchar(gsub("(.)\\1++", "1",b, perl=TRUE)), dim(a)) * 10 + a
as.character(surnames2010$surname[max.col(-d)])
[1] "CAMPBELL" "BAKER" "SMITH" "WATKINS" "BURNS" "TERRI" "RODRIGUEZ" "NEAL"
benchmark:
longest <- function(testvec,namevec){
a <- adist(testvec, namevec, counts = TRUE)
b <- attr(a, "trafos")
d <- array(grepl("S|I", b) + nchar(gsub("(.)\\1++", "1",b, perl=TRUE)), dim(a)) * 10 + a
as.character(namevec[max.col(-d)])
}
EDIT: Was able to obtain a faster method(Not necessarily the fastest)
longest2 <- function(testvec,namevec){
a <- stack(sapply(namevec,grep,testvec,value = TRUE,simplify = FALSE))
tapply(as.character(a[, 2]), a[, 1], function(x) x[which.max(nchar(x))])[testvec]
}
microbenchmark::microbenchmark(longest(testvec,lnames$surname),longest2(testvec,lnames$surname),foo_longest_matches(testvec),times = 5)
Unit: seconds
expr min lq mean median uq max neval
longest(testvec, lnames$surname) 3.316550 3.984128 5.308339 6.265192 6.396348 6.579477 5
longest2(testvec, lnames$surname) 1.817059 1.917883 2.835354 3.350068 3.538278 3.553481 5
foo_longest_matches(testvec) 10.093179 10.325489 11.610619 10.756714 10.889326 15.988384 5
Upvotes: 2
Reputation: 2085
Not sure about fastest but here is a method to test:
library(wru)
data("surnames2010")
lnames <- surnames2010[nchar(as.character(surnames2010$surname))>3,]
testvec <- c("scottcampbell","mattbaker","tsmith","watkins","burnsmary","terri","frankrodriguez","neal")
lnames$surname <- tolower(lnames$surname)
testvec <- tolower(testvec)
foo_longest_matches <- function(string_vector) {
outdf <- c()
for (name in string_vector) {
print(name)
ting <- lnames[sapply(lnames$surname, function(x) grepl(x, name)),]
# you only care about the longest, remove the next line to get all matches
ting <- ting[which.max(nchar(ting$surname)),]
outdf <- rbind(outdf, ting)
}
return(outdf)
}
get_matches <- foo_longest_matches(testvec)
get_matches
# surname p_whi p_bla p_his p_asi p_oth
# 47 campbell 0.7366 0.2047 0.02490000 0.00530000 0.02840000
# 44 baker 0.7983 0.1444 0.02280000 0.00560000 0.02890000
# 1 smith 0.7090 0.2311 0.02400000 0.00500000 0.03080000
# 240 watkins 0.6203 0.3227 0.02090000 0.00420000 0.03200000
# 155 burns 0.8026 0.1406 0.02480000 0.00590000 0.02610000
# 110133 terri 0.7453 0.1801 0.01243333 0.01243333 0.04973333
# 9 rodriguez 0.0475 0.0054 0.93770000 0.00570000 0.00360000
# 337 neal 0.6210 0.3184 0.02160000 0.00600000 0.03290000
Upvotes: 1