Joeytje50
Joeytje50

Reputation: 19122

Match contents between parentheses

My question is actually really simple: How do I match the contents of the outer parentheses in a string with nested parentheses, like the following:

Lorem (ipsum (dolor) sit) amet

I have been struggling with this problem in quite some occasions, and it seems to me there should be a really simple solution, since quite a lot of languages use tree-based structures (HTML/XML uses <tag><tag></tag></tag>, JSON uses [[]] or {"a":{}}).

The major problem here is that using any regular expressions, the match would end at the first closing parenthesis, instead of the correct one. I have also thought about using String.split('('), or even creating an array with indices of opening and closing parentheses, but those in the end got me nowhere either (perhaps I overlooked a way these tactics can be used though).

My best attempt so far is just going through every single character, testing if that character is a parenthesis, and continuing until there are 0 opened parentheses left:

var str = 'Lorem (ipsum (dolor) sit) amet';
var opened = 1;
var start = str.indexOf('(')+1
for (var i=start; i<str.length && opened; i++) {
    if (str.charAt(i) == '(') opened++;
    else if (str.charAt(i) == ')') opened--;
}
console.log(str.substring(start, i-1));

I am really wondering though, is there a simpler way to do this at all, perhaps using built-in functions I didn't think of? It seems to me there should be, but I haven't been able to find any or come up with a simple solution myself.

Upvotes: 0

Views: 168

Answers (3)

Joeytje50
Joeytje50

Reputation: 19122

Going by the current answers, and the lack of improvement in simplicity of the code, other than having to download a quite large library I'm going to assume that there isn't really a much simpler way to do this. Here's the code in the form of an easy to use function:

function getMatchedBracket(str, pos) {
    pos = pos || 0;
    var opened = 1;
    var start = i = str.indexOf('(', pos)+1
    for (; i<str.length; i++) {
        if (str.charAt(i) === '(') ++opened;
        else if (str.charAt(i) === ')' && !--opened) break;
    }
    return str.substring(start, i);
}
//and to test it
console.log(getMatchedBracket('Lorem (ipsum (dolor) sit) amet')); //ipsum (dolor) sit
//If you find this doesn't work correctly for anything, please let me know.

The forloop here doesn't use switches because those are slow in this case. Also, the test to see if the loop should be terminated is not in the condition anymore.

Upvotes: 0

user1196549
user1196549

Reputation:

If you know that there is exactly one pair of outer parenthesis, there can be a faster way: scan the string from the left until you meet a left parenthesis (strchr); scan from the right until you meet a right parenthesis (strrchr).

Three benefits: 1) during a scan you only test against a single character value; 2) you spare doing any comparisons inside the parenthesis; 3) there are builtins for this purpose.

If there can be several outer pairs, you cannot avoid scanning the whole string and bookkeeping the nesting level.

Upvotes: 0

waTeim
waTeim

Reputation: 9244

The underlying problem is the regular expressions are incapable of describing a language that requires matched parentheses because a language element like this makes the language context free, and this is an often cited example of the increased power of push down automaton (PDA) versus Finite State Automata (FA).

At its most basic, the production that is required for what you want is

N = aNa

Where a terminal appears on both sides of a non terminal. This is the signature example of a context free grammar which requires a PDA to recognize it. As opposed to a more simplistic grammar.

N = aN

Which signifies a regular expression and only requires an FA to recognize it.

You need a more powerful class of recognition then a regular expression can muster. Luckily there is a nice parser generator for javascript that I can recommend pegjs

Ok completed, did I say "a little while?" I'm familiar with nodejs so this is all written assuming that. But peg supports operations in the browser too since it's all javascript.

You know after all this time, I'm not quite sure this is what you were wanting, but I've made it so that anything that is wrapped with at least 1 pair of parenthese is preserved otherwise it is elided. I went ahead and completed this even though it because a big-ish project for illustration purposes. But if that's not what you were wanting they have a look at the parse tree that gets generated, it's all JSON....

I've stored the source in a github GIST, feel free to download, or whatever

The pegjs source

The driver javascript

The generated parser source

Sample 1

Sample 2

cat input1
Lorem (ipsum (dolor) sit) amet
node match.js < input1
ipsum (dolor) sit 

cat input2
Lorem (ipsum (dolor) sit) amet (a) a b s  (asd f d a (a d d d a) asda )
node match.js < input2
ipsum (dolor) sit  a    asd f d a (a d d d a) asda

Upvotes: 2

Related Questions