Daya
Daya

Reputation: 25

regex for pattern in grep

I am trying to look for gene symbols in some text, for that purpose I am trying to establish a pattern that matches gene symbols (they use to be three or more uppercase letters together). I tried this but it didn't work.

TW2 <- text_words [grep ("b\[[:upper:]]b\", text_words) ]

Upvotes: 1

Views: 54

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626690

You may use

text_words <- "GHJ GJKGKJ HHKKK J777 JJ8JJJJ"
TW2 <- unlist(regmatches(text_words, gregexpr("\\b[[:upper:]]{3,}\\b", text_words)))
TW2
## => [1] "GHJ"    "GJKGKJ" "HHKKK" 

See the R demo online

The pattern matches:

  • \\b - a word boundary
  • [[:upper:]]{3,} - 3 or more uppercase letters
  • \\b - a word boundary.

If you have a vector with the strings you need to test against the pattern in full, use

text_words <- c("GHJ","GJKGKJ","HHKKK","J777","JJ8JJJJ")
TW2 <- grep("^[[:upper:]]{3,}$", text_words, value=TRUE)
TW2
## => [1] "GHJ"    "GJKGKJ" "HHKKK" 

Here, word boundaries are replaced with anchors, ^ for the start of the string and $ for the end of the string. See another R demo.

Upvotes: 2

Related Questions