GegznaV
GegznaV

Reputation: 5580

Extract phrase that repeats in a string in R

The task is to extract phrase from a string using R software.

Consider having a set of strings, which have similar structure: string begins with a bracket ( and closes with an other bracket ), and in between it has several identical phrases separated with comma ,, e.g. "(matrix,matrix,matrix)". It is not known in advance how many times the phrase repeats, so it can be 2 "(matrix,matrix)",3, 4 "(matrix,matrix,matrix,matrix,)", etc. repetitions. Usually up to 6. If there is only one phrase, it is not in (additional) brackets (e.g.,"matrix" or "(matrix)").

I managed to extract one phrase by using:

NAME <- "(matrix,matrix,matrix,matrix)"
gsub("(\\()(.*,){1,}(.*)\\)", "\\3",NAME, perl = T)

But there is a more advanced question: how can I check, if the phrase repeats, and extract it only if it repeats, otherwise leave it as-is? E.g. how to extract
"matrix" from "(matrix,matrix,matrix,matrix)",
"A B" from "(A B,A B)",
"(A,B,C)" from "((A,B,C),(A,B,C),(A,B,C),(A,B,C))", and
"A,B,C" from "(A,B,C,A,B,C,A,B,C,A,B,C)",
but "(A,B,C)" must be left intact, as it does not repeat

Upvotes: 1

Views: 75

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

You can use

\((.+?)(?:,\1)+\)

See the regex demo

Pattern explanation:

  • \( - opening round bracket
  • (.+?) - Group 1 matching 1 or more characters other than a newline
  • (?:,\1)+ - 1 or more sequences of , followed with the value captured into Grouip 1
  • \) - a closing round bracket.

R demo:

> s = "(matrix,matrix,matrix)"
> gsub("\\((.+?)(?:,\\1)+\\)", "\\1", s)
[1] "matrix"
> s = "(m,d,s)"
> gsub("\\((.+?)(?:,\\1)+\\)", "\\1", s)
[1] "(m,d,s)"

Upvotes: 3

Related Questions