prejay10
prejay10

Reputation: 109

R: Why can't for loop or c() work out for grep function?

Thanks for grep using a character vector with multiple patterns, I figured out my own problem as well. The question here was how to find multiple values by using grep function, and the solution was either these:

grep("A1| A9 | A6") 

or

toMatch <- c("A1", "A9", "A6")
matches <- unique (grep(paste(toMatch,collapse="|")

So I used the second suggestion since I had MANY values to search for.

But I'm curious why c() or for loop doesn't work out instead of |. Before I researched the possible solution in stackoverflow and found recommendations above, I tried out two alternatives that I'll demonstrate below:

First, what I've written in R was something like this:

find.explore.l<-lapply(text.words.bl ,function(m) grep("^explor",m))

But then I had to 'grep' many words, so I tried out this

find.explore.l<-lapply(text.words.bl ,function(m) grep(c("A1","A2","A3"),m))

It didn't work, so I tried another one(XXX is the list of words that I'm supposed to find in the text)

for (i in XXX){
  find.explore.l<-lapply(text.words.bl ,function(m) grep("XXX[i]"),m))
    .......(more lines to append lines etc)
   }

and it seemed like R tried to match XXX[i] itself, not the words inside. Why can't c() and for loop for grep return right results? Someone please let me know! I'm so curious :P

Upvotes: 3

Views: 1796

Answers (2)

Alex A.
Alex A.

Reputation: 5586

From the documentation for the pattern= argument in the grep() function:

Character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr and gregexpr.

This confirms that, as @nrussell said in a comment, grep() is not vectorized over the pattern argument. Because of this, c() won't work for a list of regular expressions.

You could, however, use a loop, you just have to modify your syntax.

toMatch <- c("A1", "A9", "A6")

# Loop over values to match
for (i in toMatch) {
    grep(i, text)
}

Using "XXX[i]" as your pattern doesn't work because it's interpreting that as a regular expression. That is, it will match exactly XXXi. To reference an element of a vector of regular expressions, you would simply use XXX[i] (note the lack of surrounding quotes).

You can apply() this, but in a slightly different way than you had done. You apply it to each regex in the list, rather than each text string.

lapply(toMatch, function(rgx, text) grep(rgx, text), text = text)

However, the best approach would be, as you already have in your post, to use

matches <- unique(grep(paste(toMatch, collapse = "|"), text))

Upvotes: 1

Pierre L
Pierre L

Reputation: 28441

Consider that:

XXX <- c("a", "b", "XXX[i]")
grep("XXX[i]", XXX, value=T)
character(0)
grep("XXX\\[i\\]", XXX, value=T)
[1] "XXX[i]"

What is R doing? It is using special rules for the first argument of grep. The brackets are considered special characters ([ and ]). I put in two backslashes to tell R to consider them regular brackets. And imgaine what would happen if I put that last expression into a for loop? It wouldn't do what I expected.

If you would like a for loop that goes through a character vector of possible matches, take out the quotes in the grep function.

#if you want the match returned
matches <- c("a", "b")
for (i in matches) print(grep(i, XXX, value=T))
[1] "a"
[1] "b"

#if you want the vector location of the match
for (i in matches) print(grep(i, XXX))
[1] 1
[1] 2

As the comments point out, grep(c("A1","A2","A3"),m)) is violating the grep required syntax.

Upvotes: 0

Related Questions