Reputation: 7561
I have string and character vector. I would like to find all strings in character vector matching as much as possible characters from beging of string. For example:
s <- "abs"
vc <- c("ab","bb","abc","acbd","dert")
result <- c("ab","abc")
String s should be matched exactly up to first K characters. I want match for as much as possible (max K<=length(s)). Here there is no match for "abs" (grep("abs",vc)), but for "ab" there are two matches (result <-grep("ab",vc)).
Upvotes: 4
Views: 1704
Reputation: 3294
Just a note, long after the fact, that the triebeard package now exists; it's very, very efficient and user-friendly for finding longest or partial matches.
Upvotes: 0
Reputation: 48241
Another interpretation:
s <- "abs"
# Updated vc
vc <- c("ab","bb","abc","acbd","dert","abwabsabs")
st <- strsplit(s, "")[[1]]
mtc <- sapply(strsplit(substr(vc, 1, nchar(s)), ""),
function(i) {
m <- i == st[1:length(i)]
sum(m * cumsum(m))})
vc[mtc == max(mtc)]
#[1] "ab" "abc" "abwabsabs"
# Another vector vc
vc <- c("ab","bb","abc","acbd","dert","absq","abab")
....
vc[mtc == max(mtc)]
#[1] "absq"
Since we are considering only beginnings of strings, in the first case the longest match was "ab"
, even though there is "abwabsabs"
which has "abs"
.
Edit: Here is a "single pattern" solution, possibly it could be more concise, but here we go...
vc <- c("ab","bb","abc","acbd","dert","abwabsabs")
(auxOne <- sapply((nchar(s)-1):1, function(i) substr(s, 1, i)))
#[1] "ab" "a"
(auxTwo <- sapply(nchar(s):2, function(i) substring(s, i)))
#[1] "s" "bs"
l <- attr(regexpr(
paste0("^((",s,")|",paste0("(",auxOne,"(?!",auxTwo,"))",collapse="|"),")"),
vc, perl = TRUE), "match.length")
vc[l == max(l)]
#[1] "ab" "abc" "abwabsabs"
Upvotes: 2
Reputation: 19454
Here's a function that uses grep
and checks to see if a given string s
matches the beginning of any string in vc
, recursively removing a character from the end of s
:
myfun <- function(s, vc) {
notDone <- TRUE
maxChar <- max(nchar(vc)) # EDIT: these two lines truncate s to
s <- substr(s, 1, maxChar) # the maximum number of chars in vc
subN <- nchar(s)
while(notDone & subN > 0){
ss <- substr(s, 1, subN)
ans <- grep(sprintf("^%s", ss), vc, val = TRUE)
if(length(ans)) {
notDone <- FALSE
} else {
subN <- subN - 1
}
}
return(ans)
}
s <- "abs"
# Updated vc from @Julius's answer
vc <- c("ab","bb","abc","acbd","dert","absq","abab")
> myfun(s, vc)
[1] "absq"
# And there's no infinite recursion if there's no match
> myfun("q", "a")
character(0)
Upvotes: 1