Reputation: 25
I am trying to look for gene symbols in some text, for that purpose I am trying to establish a pattern that matches gene symbols (they use to be three or more uppercase letters together). I tried this but it didn't work.
TW2 <- text_words [grep ("b\[[:upper:]]b\", text_words) ]
Upvotes: 1
Views: 54
Reputation: 626690
You may use
text_words <- "GHJ GJKGKJ HHKKK J777 JJ8JJJJ"
TW2 <- unlist(regmatches(text_words, gregexpr("\\b[[:upper:]]{3,}\\b", text_words)))
TW2
## => [1] "GHJ" "GJKGKJ" "HHKKK"
See the R demo online
The pattern matches:
\\b
- a word boundary[[:upper:]]{3,}
- 3 or more uppercase letters\\b
- a word boundary.If you have a vector with the strings you need to test against the pattern in full, use
text_words <- c("GHJ","GJKGKJ","HHKKK","J777","JJ8JJJJ")
TW2 <- grep("^[[:upper:]]{3,}$", text_words, value=TRUE)
TW2
## => [1] "GHJ" "GJKGKJ" "HHKKK"
Here, word boundaries are replaced with anchors, ^
for the start of the string and $
for the end of the string. See another R demo.
Upvotes: 2