biocyberman
biocyberman

Reputation: 5925

R: tell if a substring exists in another string

I am looking for a function (let's call it scramblematch) that can do the following.

query='one five six'
target1='one six two six three four five '
target2=' two five six'

scramblematch(query, target1) returns TRUE and

scramblematch(query, targ2) returns FALSE

The stringdist package might be what I need, but I don't know how to use it.

Update1

Use case for the function I am looking for: I have a dataset with data entered gradually over the years. Values for one text field (textfield) of the dataset is not standardized so people entered differently. Now I want to clean up this data by using a standardized set of values for textfield. All those values that describes the same things by different wordings are to be replaced by standardized values. For example (I am making this up):

In my standardized choices of values (let's call this lookupfactors), I have lookupfactors=c('liver disease', 'and more'). In the textfield I have following rows:

liver cancer disease
some other thing
male, liver fibrosis disease
yet another thing
failure of liver, disease

I want in the final result, to have row 1, 3, and 5 (because they have 'liver' and 'disease' in the content) to be replaced by liver disease. Here I assume that people who entered the data do not know the precise term, but they know the keywords to put it. Therefore words in the values of lookupfactors are substring/subset of those in textfield.

Upvotes: 1

Views: 485

Answers (2)

talat
talat

Reputation: 70336

One option to implement it is with %in% and strsplit:

scramblematch <- function(query, target, sep = " ") {
  all(unlist(strsplit(query, sep)) %in% unlist(strsplit(target, sep)))
}
scramblematch(query, target1)
#[1] TRUE
scramblematch(query, target2)
#[1] FALSE

A vectorized approach using stringi could be

library(stringi)
scramblematch <- function(query, target, sep = " ") {
  q <- stri_split_fixed(query, sep)[[1L]]
  sapply(stri_split_fixed(target, sep), function(x) {
    all(q %in% x)
  })
}

scramblematch(query, c(target1, target2))
#[1]  TRUE FALSE

Upvotes: 3

nicola
nicola

Reputation: 24520

You can try (the fixed=TRUE improvement is from @David's comment):

scramblematch<-function(query,target) {
   Reduce("&",lapply(strsplit(query," ")[[1]],grepl,target,fixed=TRUE))
}

Some benchmark:

query='one five six'
target1='one six two six three four five '
target2=' two five six'
target<-rep(c(target1,target2),10000)
system.time(scramblematch(query,target))   
# user  system elapsed 
#0.008   0.000   0.008
scramblematchDD <- function(query, target, sep = " ") {
  all(unlist(strsplit(query, sep)) %in% unlist(strsplit(target, sep)))
}
system.time(vapply(target,scramblematchDD,query=query,TRUE))   
# user  system elapsed 
#0.657   0.000   0.658

The vapply in the @docendodiscimus solution is needed, since it is not vectorized.

Upvotes: 3

Related Questions