Fastest Way To Find Last Names From String in R

Question

I am trying to identify likely last name from parts of name strings in various formats in R. What is the fastest way to identify the longest string match from the dataset of last names to a given name string (I'm using the wru surnames2010 dataset)?

I need the longest possibility rather than any possibility. I.e. in the example below the first string "scottcampbell" contains possible surnames "scott" and "campbell". I want to only return the longest of the possible matches, in this case only "campbell".

Reproduce example data:

library(wru)
data("surnames2010")
#filter out names under 4 characters
lnames <- surnames2010[nchar(as.character(surnames2010$surname))>3,]
testvec <- c("scottcampbell","mattbaker","tsmith","watkins","burnsmary","terri","frankrodriguez","neal")

Desired imagined function+result:

foo_longest_matches(testvec)
#Desired imagined result:
[1] "campbell" "baker" "smith" "watkins" "burns" "terri" "rodriguez" "neal")

Onyambu · Accepted Answer

You could use adist. Please note that you are doing more than 1million comparisons to obtain the longest. I would prefer you use a different method. The best so far that I have in mind is

a <- adist(toupper(testvec), surnames2010$surname, counts = TRUE)
b <- attr(a, "trafos")
d <- array(grepl("S|I", b) + nchar(gsub("(.)\1++", "1",b, perl=TRUE)), dim(a)) * 10 + a
as.character(surnames2010$surname[max.col(-d)])
[1] "CAMPBELL"  "BAKER"     "SMITH"     "WATKINS"   "BURNS"     "TERRI"     "RODRIGUEZ" "NEAL"

benchmark:

longest <- function(testvec,namevec){
  a <- adist(testvec, namevec, counts = TRUE)
  b <- attr(a, "trafos")
  d <- array(grepl("S|I", b) + nchar(gsub("(.)\1++", "1",b, perl=TRUE)), dim(a)) * 10 + a
  as.character(namevec[max.col(-d)])
}

EDIT: Was able to obtain a faster method(Not necessarily the fastest)

longest2 <- function(testvec,namevec){
  a <- stack(sapply(namevec,grep,testvec,value = TRUE,simplify = FALSE))
  tapply(as.character(a[, 2]), a[, 1], function(x) x[which.max(nchar(x))])[testvec]
}


microbenchmark::microbenchmark(longest(testvec,lnames$surname),longest2(testvec,lnames$surname),foo_longest_matches(testvec),times = 5)
Unit: seconds
                              expr       min        lq      mean    median        uq       max neval
  longest(testvec, lnames$surname)  3.316550  3.984128  5.308339  6.265192  6.396348  6.579477     5
 longest2(testvec, lnames$surname)  1.817059  1.917883  2.835354  3.350068  3.538278  3.553481     5
      foo_longest_matches(testvec) 10.093179 10.325489 11.610619 10.756714 10.889326 15.988384     5

Fastest Way To Find Last Names From String in R

Answers (2)

Related Questions