Reputation: 5925
I am looking for a function (let's call it scramblematch
) that can do the following.
query='one five six'
target1='one six two six three four five '
target2=' two five six'
scramblematch(query, target1)
returns TRUE
and
scramblematch(query, targ2)
returns FALSE
The stringdist
package might be what I need, but I don't know how to use it.
Update1
Use case for the function I am looking for: I have a dataset with data entered gradually over the years. Values for one text field (textfield
) of the dataset is not standardized so people entered differently. Now I want to clean up this data by using a standardized set of values for textfield
. All those values that describes the same things by different wordings are to be replaced by standardized values. For example (I am making this up):
In my standardized choices of values (let's call this lookupfactors
), I have lookupfactors=c('liver disease', 'and more')
.
In the textfield
I have following rows:
liver cancer disease
some other thing
male, liver fibrosis disease
yet another thing
failure of liver, disease
I want in the final result, to have row 1, 3, and 5 (because they have 'liver' and 'disease' in the content) to be replaced by liver disease
. Here I assume that people who entered the data do not know the precise term, but they know the keywords to put it. Therefore words in the values of lookupfactors
are substring/subset of those in textfield
.
Upvotes: 1
Views: 485
Reputation: 70336
One option to implement it is with %in%
and strsplit
:
scramblematch <- function(query, target, sep = " ") {
all(unlist(strsplit(query, sep)) %in% unlist(strsplit(target, sep)))
}
scramblematch(query, target1)
#[1] TRUE
scramblematch(query, target2)
#[1] FALSE
A vectorized approach using stringi
could be
library(stringi)
scramblematch <- function(query, target, sep = " ") {
q <- stri_split_fixed(query, sep)[[1L]]
sapply(stri_split_fixed(target, sep), function(x) {
all(q %in% x)
})
}
scramblematch(query, c(target1, target2))
#[1] TRUE FALSE
Upvotes: 3
Reputation: 24520
You can try (the fixed=TRUE
improvement is from @David's comment):
scramblematch<-function(query,target) {
Reduce("&",lapply(strsplit(query," ")[[1]],grepl,target,fixed=TRUE))
}
Some benchmark:
query='one five six'
target1='one six two six three four five '
target2=' two five six'
target<-rep(c(target1,target2),10000)
system.time(scramblematch(query,target))
# user system elapsed
#0.008 0.000 0.008
scramblematchDD <- function(query, target, sep = " ") {
all(unlist(strsplit(query, sep)) %in% unlist(strsplit(target, sep)))
}
system.time(vapply(target,scramblematchDD,query=query,TRUE))
# user system elapsed
#0.657 0.000 0.658
The vapply
in the @docendodiscimus solution is needed, since it is not vectorized.
Upvotes: 3