Reputation: 13
I have many strings that all have the following format:
mystrings <- c(
"(ABFUHIASH)THISISAVERYLONGSTRINGWITHOUTANYSPACES(ENDING)",
"(SECONDSTR)YETANOTHERBORINGSTRINGWITHOUTSPACES(RANDOMENDING)",
"(JOWERIC)THISPARTSHOULDNOTBEEXTRACTED(GETTHIS)",
"(CAPTURETHIS)IOJSDOIOIADSNCXZZCX(IJFAI)"
)
I need to capture the strings that are inside parentheses both at the start and the end of the original mystrings
.
Therefore, variable start
will store the starting characters for each of the above strings with the same index. The result will be this:
start[1]
ABFUHIASH
start[2]
SECONDSTR
start[3]
JOWERIC
start[4]
CAPTURETHIS
And similarly, the ending for each string in mystrings
will be saved into end
:
end[1]
ENDING
end[2]
RANDOMENDING
end[3]
GETTHIS
end[4]
IJFAI
Parentheses themselves should NOT be captured.
Is there a way/function to do this quickly in R?
I have tried stringr::word
and stringi::stri_extract
, but I am getting very strange results.
Upvotes: 0
Views: 54
Reputation: 206197
We can use the stringr
library for this. For example
library(stringr)
mm <- str_match(mystrings, "^\\(([^)]+)\\).*\\(([^)]+)\\)$")
mm
The match finds the stuff between the parenthesis at the beginning and end of the string in capture groups so they can be easily extracted.
It returns a character matrix, and you seem to just want the 2nd and 3rd column. mm[,2:3]
[,1] [,2]
[1,] "ABFUHIASH" "ENDING"
[2,] "SECONDSTR" "RANDOMENDING"
[3,] "JOWERIC" "GETTHIS"
[4,] "CAPTURETHIS" "IJFAI"
Upvotes: 2
Reputation: 992
Something like this might work for you:
> regmatches(mystrings,gregexpr("\\(.+?\\)",mystrings))
[[1]]
[1] "(ABFUHIASH)" "(ENDING)"
[[2]]
[1] "(SECONDSTR)" "(RANDOMENDING)"
[[3]]
[1] "(JOWERIC)" "(GETTHIS)"
[[4]]
[1] "(CAPTURETHIS)" "(IJFAI)"
E.g., to extract endings you could:
lapply(x,tail,1)
Upvotes: 0