Phil_T
Phil_T

Reputation: 1002

R: parse nested parentheses

I would like to parse nested parentheses using R. No, this is not JASON. I have seen examples using perl, php, and python, but I am having trouble getting anything to work in R. Here is an example of some data:

(a(a(a)(aa(a)a)a)a)((b(b)b)b)(((cc)c)c)

I would like to split this string based on the three parent parentheses into three separate strings:

(a(a(a)(aa(a)a)a)a)

((b(b)b)b)

(((cc)c)c)

One of the challenges I am facing is the lack of a consistent structure in terms of total pairs of child parentheses within the parent parentheses, and the number of consecutive open or closed parentheses. Notice the consecutive open parentheses in the data with Bs and with Cs. This has made attempts to use regex very difficult. Also, the data within a given parent parentheses will have many common characters to other parent parentheses, so looking for all "a"s or "b"s is not possible - I fabricated this data to help people see the three parent parentheses better.

Basically I am looking for a function that identifies parent parentheses. In other words, a function that can find parentheses that are not contained with parentheses, and return all instances of this for a given string.

Any ideas? I appreciate the help.

Upvotes: 4

Views: 1538

Answers (2)

Sandipan Dey
Sandipan Dey

Reputation: 23101

Assuming that there are matching paranthesis, you can try the following (this is like a PDA, pushdown automata, if you are familiar with theory of computation):

str <- '(a(a(a)(aa(a)a)a)a)((b(b)b)b)(((cc)c)c)'
indices <- c(0, which(cumsum(sapply(unlist(strsplit(str, split='')), 
                function(x) ifelse(x == '(', 1, ifelse(x==')', -1, 0))))==0))
sapply(1:(length(indices)-1), function(i) substring(str, indices[i]+1, indices[i+1]))
# [1] "(a(a(a)(aa(a)a)a)a)" "((b(b)b)b)"          "(((cc)c)c)"         

Upvotes: 2

akuiper
akuiper

Reputation: 214927

Here is one directly adapted from Regex Recursion with \\((?>[^()]|(?R))*\\):

s = "(a(a(a)(aa(a)a)a)a)((b(b)b)b)(((cc)c)c)"
matched <- gregexpr("\\((?>[^()]|(?R))*\\)", s, perl = T)
substring(s, matched[[1]], matched[[1]] + attr(matched[[1]], "match.length") - 1)
# [1] "(a(a(a)(aa(a)a)a)a)" "((b(b)b)b)"          "(((cc)c)c)"   

Upvotes: 9

Related Questions