please help
please help

Reputation: 197

R: Finding multiple string matches in a vector of strings

I have the following list of file names:

files.list <- c("Fasted DWeib NoCmaxW.xlsx", "Fed DWeib NoCmaxW.xlsx", "Fasted SWeib NoCmaxW.xlsx", "Fed SWeib NoCmaxW.xlsx", "Fasted DWeib Cmax10.xlsx", "Fed DWeib Cmax10.xlsx", "Fasted SWeib Cmax10.xlsx", "Fed SWeib Cmax10.xlsx")

I want to identify which files have the following sub-strings:

toMatch <- c("Fasted", "DWeib NoCmaxW")

The examples I have found often quote the following usage:

grep(paste(toMatch, collapse = "|"), files.list, value=TRUE)

However, this returns four possibilities:

[1] "Fasted DWeib NoCmaxW.xlsx" "Fed DWeib NoCmaxW.xlsx"    "Fasted SWeib NoCmaxW.xlsx"
[4] "Fasted DWeib Cmax10.xlsx"  "Fasted SWeib Cmax10.xlsx" 

I want the filename which contains both elements of toMatch (i.e. "Fasted" and "DWeib NoCmaxW"). There is only one file which satisfies that requirement (files.list[1]). I assumed the "|" in the paste command might be a logical OR, and so I tried "&", but that didn't address my problem.

Can someone please help?

Thank you.

Upvotes: 7

Views: 2399

Answers (1)

akrun
akrun

Reputation: 886938

We can use &

i1 <- grepl(toMatch[1], files.list) & grepl(toMatch[2], files.list)

If there are multiple elements in 'toMatch', loop through them with lapply and Reduce to a single logical vector with &

i1 <- Reduce(`&`, lapply(toMatch, grepl, x = files.list))
files.list[i1]
#[1] "Fasted DWeib NoCmaxW.xlsx"

It is also possible to collapse the elements with .* i.e. to match first word of 'toMatch' followed by a word boundary(\\b) then some characters (.*) and another word boundary (\\b) before the second word of 'toMatch'. In this example it works. May be it is better to add the word boundary at the start and end as well (which is not needed for this example)

pat1 <- paste(toMatch, collapse= "\\b.*\\b")
grep(pat1, files.list, value = TRUE)
#[1] "Fasted DWeib NoCmaxW.xlsx"

But, this will look for matches in the same order of words in 'toMatch'. In case, if have substring in reverse order and want to match those as well, create the pattern in the reverse order and then collapse with |

pat2 <- paste(rev(toMatch), collapse="\\b.*\\b")
pat <- paste(pat1, pat2, sep="|")
grep(pat, files.list, value = TRUE) 
#[1] "Fasted DWeib NoCmaxW.xlsx"

Upvotes: 5

Related Questions