Reputation: 5580
The task is to extract phrase from a string using R
software.
Consider having a set of strings, which have similar structure: string begins with a bracket (
and closes with an other bracket )
, and in between it has several identical phrases separated with comma ,
, e.g. "(matrix,matrix,matrix)"
. It is not known in advance how many times the phrase repeats, so it can be 2 "(matrix,matrix)"
,3, 4 "(matrix,matrix,matrix,matrix,)"
, etc. repetitions. Usually up to 6. If there is only one phrase, it is not in (additional) brackets (e.g.,"matrix"
or "(matrix)"
).
I managed to extract one phrase by using:
NAME <- "(matrix,matrix,matrix,matrix)"
gsub("(\\()(.*,){1,}(.*)\\)", "\\3",NAME, perl = T)
But there is a more advanced question: how can I check, if the phrase repeats, and extract it only if it repeats, otherwise leave it as-is? E.g. how to extract
"matrix"
from "(matrix,matrix,matrix,matrix)"
,
"A B"
from "(A B,A B)"
,
"(A,B,C)"
from "((A,B,C),(A,B,C),(A,B,C),(A,B,C))"
, and
"A,B,C"
from "(A,B,C,A,B,C,A,B,C,A,B,C)"
,
but "(A,B,C)"
must be left intact, as it does not repeat
Upvotes: 1
Views: 75
Reputation: 626738
You can use
\((.+?)(?:,\1)+\)
See the regex demo
Pattern explanation:
\(
- opening round bracket(.+?)
- Group 1 matching 1 or more characters other than a newline(?:,\1)+
- 1 or more sequences of ,
followed with the value captured into Grouip 1\)
- a closing round bracket.R demo:
> s = "(matrix,matrix,matrix)"
> gsub("\\((.+?)(?:,\\1)+\\)", "\\1", s)
[1] "matrix"
> s = "(m,d,s)"
> gsub("\\((.+?)(?:,\\1)+\\)", "\\1", s)
[1] "(m,d,s)"
Upvotes: 3