DaveF
DaveF

Reputation: 6869

Regular expression to match balanced parentheses

I need a regular expression to select all the text between two outer brackets.

Example:
START_TEXT(text here(possible text)text(possible text(more text)))END_TXT
^ ^

Result:
(text here(possible text)text(possible text(more text)))

Upvotes: 449

Views: 528868

Answers (23)

Fedor Tykmakov
Fedor Tykmakov

Reputation: 39

As Uniform answer:

This would be much more efficient and portable to most languages if you use the regex to do that it building for - fast pattern matching. (not logic by it) some languages has nested regex features, but even in it this be less functionality, then in Stack algorithm working over the regex base.

Even if you realy need ony top level brace select you better check stack solution (cause of portability regex code, and efficient)

Here is the example by JS:

let[struct]='a{aa},b{},c{ca,cb},d{da{daa,dab},db}'
.matchAll(/([^,\{\}]+)(\{?)(\}?)(,?)/g).reduce(($,[_,key,brO,brC,com])=>{
    let val=brO?{}:$.n++; // define nested or primitive value
    // primitive can set by more complex regex logic above just 'key' grp
    // for example primitive set just natural index (starts from zero)
    $.at(-1)[key]=val;      // set key=val to the current level of object
    if(brO)$.push(val);     // push stack level if need
    if(brC)$.pop();         // pop stack level if need
    return $;               // return stack
},Object.assign([{}],{n:0}))
console.log("struct:",struct)

you can reduce this version

  • for only selecting top level baraket chank (but this step most likely will also only the part of you job, already or soon)

or you can code above:

  • to chank the key part more complex (e.g. split it to 'key:type' to add some primitive Type aliases)
  • to create some custom JSON alternative format
  • to test in parallel some other structure by string pattern

Upvotes: 1

bobble bubble
bobble bubble

Reputation: 18565

I want to add this answer for quickreference. Feel free to update.


.NET Regex using balancing groups:

\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)

Where c is used as the depth counter.

Demo at Regexstorm.com


PCRE using a recursive pattern:

\((?:[^)(]+|(?R))*+\)

Demo at regex101; Or without alternation:

\((?:[^)(]*(?R)?)*+\)

Demo at regex101; Or unrolled for performance:

\([^)(]*+(?:(?R)[^)(]*)*+\)

Demo at regex101; The pattern is pasted at (?R) which represents (?0).

Perl, PHP, Notepad++, R: perl=TRUE, Python: PyPI regex module with (?V1) for Perl behaviour.
(the new version of PyPI regex package already defaults to this → DEFAULT_VERSION = VERSION1)


Ruby using subexpression calls:

With Ruby 2.0 \g<0> can be used to call full pattern.

\((?>[^)(]+|\g<0>)*\)

Demo at Rubular; Ruby 1.9 only supports capturing group recursion:

(\((?>[^)(]+|\g<1>)*\))

Demo at Rubular  (atomic grouping since Ruby 1.9.3)


JavaScript  API :: XRegExp.matchRecursive or regex-recursion

XRegExp.matchRecursive(str, '\\(', '\\)', 'g');

Java: An interesting idea using forward references by @jaytea.


Without recursion up to 3 levels of nesting:
(JS, Java and other regex flavors)

To prevent runaway if unbalanced, with * on innermost [)(] only.

\((?:[^)(]|\((?:[^)(]|\((?:[^)(]|\([^)(]*\))*\))*\))*\)

Demo at regex101; Or unrolled for better performance (preferred).

\([^)(]*(?:\([^)(]*(?:\([^)(]*(?:\([^)(]*\)[^)(]*)*\)[^)(]*)*\)[^)(]*)*\)

Demo at regex101; Deeper nesting needs to be added as required.

// JS-Snippet to generate pattern
function generatePattern()
{
  // Set max depth & pattern type
  let d = document.getElementById("maxDepth").value;
  let t = document.getElementById("patternType").value;
  
  // Pattern variants: 0=default, 1=unrolled (more efficient)
  let p = [['\\((?:[^)(]|',')*\\)'], ['\\([^)(]*(?:','[^)(]*)*\\)']];
  
  // Generate and display the pattern
  console.log(p[t][0].repeat(d) + '\\([^)(]*\\)' + p[t][1].repeat(d));
} generatePattern();
Max depth = <input type="text" id="maxDepth" size="1" value="3"> 
<select id="patternType" onchange="generatePattern()">
  <option value="0">default pattern</option>
  <option value="1" selected>unrolled pattern</option>
</select>
<input type="submit" onclick="generatePattern()" value="generate!">


Reference - What does this regex mean?

Upvotes: 291

Alan G&#243;mez
Alan G&#243;mez

Reputation: 378

A Vim regex that achieve that is:

:%s/\(.\{-}\)\((\(.\{-}(\)\+.\{-}\().\{-}\)\+)\)\(.*\)/\2/g 

Upvotes: 0

Manish
Manish

Reputation: 3663

I was also stuck in this situation when dealing with nested patterns and regular-expressions is the right tool to solve such problems.

/(\((?>[^()]+|(?1))*\))/

Upvotes: 14

Frank
Frank

Reputation: 66254

Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.

But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.

Upvotes: 186

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627609

Adding to bobble bubble's answer, there are other regex flavors where recursive constructs are supported.

Lua

Use %b() (%b{} / %b[] for curly braces / square brackets):

  • for s in string.gmatch("Extract (a(b)c) and ((d)f(g))", "%b()") do print(s) end (see demo)

Raku (former Perl6):

Non-overlapping multiple balanced parentheses matches:

my regex paren_any { '(' ~ ')' [ <-[()]>+ || <&paren_any> ]* }
say "Extract (a(b)c) and ((d)f(g))" ~~ m:g/<&paren_any>/;
# => (「(a(b)c)」 「((d)f(g))」)

Overlapping multiple balanced parentheses matches:

say "Extract (a(b)c) and ((d)f(g))" ~~ m:ov:g/<&paren_any>/;
# => (「(a(b)c)」 「(b)」 「((d)f(g))」 「(d)」 「(g)」)

See demo.

Python re non-regex solution

See poke's answer for How to get an expression between balanced parentheses.

Java customizable non-regex solution

Here is a customizable solution allowing single character literal delimiters in Java:

public static List<String> getBalancedSubstrings(String s, Character markStart, 
                                 Character markEnd, Boolean includeMarkers) 

{
        List<String> subTreeList = new ArrayList<String>();
        int level = 0;
        int lastOpenDelimiter = -1;
        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            if (c == markStart) {
                level++;
                if (level == 1) {
                    lastOpenDelimiter = (includeMarkers ? i : i + 1);
                }
            }
            else if (c == markEnd) {
                if (level == 1) {
                    subTreeList.add(s.substring(lastOpenDelimiter, (includeMarkers ? i + 1 : i)));
                }
                if (level > 0) level--;
            }
        }
        return subTreeList;
    }
}

Sample usage:

String s = "some text(text here(possible text)text(possible text(more text)))end text";
List<String> balanced = getBalancedSubstrings(s, '(', ')', true);
System.out.println("Balanced substrings:\n" + balanced);
// => [(text here(possible text)text(possible text(more text)))]

Upvotes: 6

TOPKAT
TOPKAT

Reputation: 8698

This do not fully address the OP question but I though it may be useful to some coming here to search for nested structure regexp:

Parse parmeters from function string (with nested structures) in javascript

Match structures like:
Parse parmeters from function string

  • matches brackets, square brackets, parentheses, single and double quotes

Here you can see generated regexp in action

/**
 * get param content of function string.
 * only params string should be provided without parentheses
 * WORK even if some/all params are not set
 * @return [param1, param2, param3]
 */
exports.getParamsSAFE = (str, nbParams = 3) => {
    const nextParamReg = /^\s*((?:(?:['"([{](?:[^'"()[\]{}]*?|['"([{](?:[^'"()[\]{}]*?|['"([{][^'"()[\]{}]*?['")}\]])*?['")}\]])*?['")}\]])|[^,])*?)\s*(?:,|$)/;
    const params = [];
    while (str.length) { // this is to avoid a BIG performance issue in javascript regexp engine
        str = str.replace(nextParamReg, (full, p1) => {
            params.push(p1);
            return '';
        });
    }
    return params;
};

Upvotes: 0

Kishor Patil
Kishor Patil

Reputation: 950

This might help to match balanced parenthesis.

\s*\w+[(][^+]*[)]\s*

Upvotes: -2

Daniel
Daniel

Reputation: 2308

I didn't use regex since it is difficult to deal with nested code. So this snippet should be able to allow you to grab sections of code with balanced brackets:

def extract_code(data):
    """ returns an array of code snippets from a string (data)"""
    start_pos = None
    end_pos = None
    count_open = 0
    count_close = 0
    code_snippets = []
    for i,v in enumerate(data):
        if v =='{':
            count_open+=1
            if not start_pos:
                start_pos= i
        if v=='}':
            count_close +=1
            if count_open == count_close and not end_pos:
                end_pos = i+1
        if start_pos and end_pos:
            code_snippets.append((start_pos,end_pos))
            start_pos = None
            end_pos = None

    return code_snippets

I used this to extract code snippets from a text file.

Upvotes: 0

Prakhar Agrawal
Prakhar Agrawal

Reputation: 1032

While so many answers mention this in some form by saying that regex does not support recursive matching and so on, the primary reason for this lies in the roots of the Theory of Computation.

Language of the form {a^nb^n | n>=0} is not regular. Regex can only match things that form part of the regular set of languages.

Read more @ here

Upvotes: 0

crapthings
crapthings

Reputation: 2485

because js regex doesn't support recursive match, i can't make balanced parentheses matching work.

so this is a simple javascript for loop version that make "method(arg)" string into array

push(number) map(test(a(a()))) bass(wow, abc)
$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)
const parser = str => {
  let ops = []
  let method, arg
  let isMethod = true
  let open = []

  for (const char of str) {
    // skip whitespace
    if (char === ' ') continue

    // append method or arg string
    if (char !== '(' && char !== ')') {
      if (isMethod) {
        (method ? (method += char) : (method = char))
      } else {
        (arg ? (arg += char) : (arg = char))
      }
    }

    if (char === '(') {
      // nested parenthesis should be a part of arg
      if (!isMethod) arg += char
      isMethod = false
      open.push(char)
    } else if (char === ')') {
      open.pop()
      // check end of arg
      if (open.length < 1) {
        isMethod = true
        ops.push({ method, arg })
        method = arg = undefined
      } else {
        arg += char
      }
    }
  }

  return ops
}

// const test = parser(`$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)`)
const test = parser(`push(number) map(test(a(a()))) bass(wow, abc)`)

console.log(test)

the result is like

[ { method: 'push', arg: 'number' },
  { method: 'map', arg: 'test(a(a()))' },
  { method: 'bass', arg: 'wow,abc' } ]
[ { method: '$$', arg: 'groups' },
  { method: 'filter',
    arg: '{type:\'ORGANIZATION\',isDisabled:{$ne:true}}' },
  { method: 'pickBy', arg: '_id,type' },
  { method: 'map', arg: 'test()' },
  { method: 'as', arg: 'groups' } ]

Upvotes: 0

Somnath Musib
Somnath Musib

Reputation: 3724

This answer explains the theoretical limitation of why regular expressions are not the right tool for this task.


Regular expressions can not do this.

Regular expressions are based on a computing model known as Finite State Automata (FSA). As the name indicates, a FSA can remember only the current state, it has no information about the previous states.

FSA

In the above diagram, S1 and S2 are two states where S1 is the starting and final step. So if we try with the string 0110 , the transition goes as follows:

      0     1     1     0
-> S1 -> S2 -> S2 -> S2 ->S1

In the above steps, when we are at second S2 i.e. after parsing 01 of 0110, the FSA has no information about the previous 0 in 01 as it can only remember the current state and the next input symbol.

In the above problem, we need to know the no of opening parenthesis; this means it has to be stored at some place. But since FSAs can not do that, a regular expression can not be written.

However, an algorithm can be written to do this task. Algorithms are generally falls under Pushdown Automata (PDA). PDA is one level above of FSA. PDA has an additional stack to store some additional information. PDAs can be used to solve the above problem, because we can 'push' the opening parenthesis in the stack and 'pop' them once we encounter a closing parenthesis. If at the end, stack is empty, then opening parenthesis and closing parenthesis matches. Otherwise not.

Upvotes: 30

Shell Scott
Shell Scott

Reputation: 1909

You need the first and last parentheses. Use something like this:

str.indexOf('('); - it will give you first occurrence

str.lastIndexOf(')'); - last one

So you need a string between,

String searchedString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');

Upvotes: 1

Chad
Chad

Reputation: 9859

I have written a little JavaScript library called balanced to help with this task. You can accomplish this by doing

balanced.matches({
    source: source,
    open: '(',
    close: ')'
});

You can even do replacements:

balanced.replacements({
    source: source,
    open: '(',
    close: ')',
    replace: function (source, head, tail) {
        return head + source + tail;
    }
});

Here's a more complex and interactive example JSFiddle.

Upvotes: 5

Hamid Mir
Hamid Mir

Reputation: 373

This one also worked

re.findall(r'\(.+\)', s)

Upvotes: -4

Gene Olson
Gene Olson

Reputation: 930

"""
Here is a simple python program showing how to use regular
expressions to write a paren-matching recursive parser.

This parser recognises items enclosed by parens, brackets,
braces and <> symbols, but is adaptable to any set of
open/close patterns.  This is where the re package greatly
assists in parsing. 
"""

import re


# The pattern below recognises a sequence consisting of:
#    1. Any characters not in the set of open/close strings.
#    2. One of the open/close strings.
#    3. The remainder of the string.
# 
# There is no reason the opening pattern can't be the
# same as the closing pattern, so quoted strings can
# be included.  However quotes are not ignored inside
# quotes.  More logic is needed for that....


pat = re.compile("""
    ( .*? )
    ( \( | \) | \[ | \] | \{ | \} | \< | \> |
                           \' | \" | BEGIN | END | $ )
    ( .* )
    """, re.X)

# The keys to the dictionary below are the opening strings,
# and the values are the corresponding closing strings.
# For example "(" is an opening string and ")" is its
# closing string.

matching = { "(" : ")",
             "[" : "]",
             "{" : "}",
             "<" : ">",
             '"' : '"',
             "'" : "'",
             "BEGIN" : "END" }

# The procedure below matches string s and returns a
# recursive list matching the nesting of the open/close
# patterns in s.

def matchnested(s, term=""):
    lst = []
    while True:
        m = pat.match(s)

        if m.group(1) != "":
            lst.append(m.group(1))

        if m.group(2) == term:
            return lst, m.group(3)

        if m.group(2) in matching:
            item, s = matchnested(m.group(3), matching[m.group(2)])
            lst.append(m.group(2))
            lst.append(item)
            lst.append(matching[m.group(2)])
        else:
            raise ValueError("After <<%s %s>> expected %s not %s" %
                             (lst, s, term, m.group(2)))

# Unit test.

if __name__ == "__main__":
    for s in ("simple string",
              """ "double quote" """,
              """ 'single quote' """,
              "one'two'three'four'five'six'seven",
              "one(two(three(four)five)six)seven",
              "one(two(three)four)five(six(seven)eight)nine",
              "one(two)three[four]five{six}seven<eight>nine",
              "one(two[three{four<five>six}seven]eight)nine",
              "oneBEGINtwo(threeBEGINfourENDfive)sixENDseven",
              "ERROR testing ((( mismatched ))] parens"):
        print "\ninput", s
        try:
            lst, s = matchnested(s)
            print "output", lst
        except ValueError as e:
            print str(e)
    print "done"

Upvotes: 2

Marco
Marco

Reputation: 338

This is the definitive regex:

\(
(?<arguments> 
(  
  ([^\(\)']*) |  
  (\([^\(\)']*\)) |
  '(.*?)'

)*
)
\)

Example:

input: ( arg1, arg2, arg3, (arg4), '(pip' )

output: arg1, arg2, arg3, (arg4), '(pip'

note that the '(pip' is correctly managed as string. (tried in regulator: http://sourceforge.net/projects/regulator/)

Upvotes: 6

Joy Hu
Joy Hu

Reputation: 556

The regular expression using Ruby (version 1.9.3 or above):

/(?<match>\((?:\g<match>|[^()]++)*\))/

Demo on rubular

Upvotes: 5

Tomalak
Tomalak

Reputation: 338416

(?<=\().*(?=\))

If you want to select text between two matching parentheses, you are out of luck with regular expressions. This is impossible(*).

This regex just returns the text between the first opening and the last closing parentheses in your string.


(*) Unless your regex engine has features like balancing groups or recursion. The number of engines that support such features is slowly growing, but they are still not a commonly available.

Upvotes: 28

Alexander Bartosh
Alexander Bartosh

Reputation: 8857

It is actually possible to do it using .NET regular expressions, but it is not trivial, so read carefully.

You can read a nice article here. You also may need to read up on .NET regular expressions. You can start reading here.

Angle brackets <> were used because they do not require escaping.

The regular expression looks like this:

<
[^<>]*
(
    (
        (?<Open><)
        [^<>]*
    )+
    (
        (?<Close-Open>>)
        [^<>]*
    )+
)*
(?(Open)(?!))
>

Upvotes: 15

rogal111
rogal111

Reputation: 5933

You can use regex recursion:

\(([^()]|(?R))*\)

Upvotes: 145

Zach Scrivena
Zach Scrivena

Reputation: 29569

[^\(]*(\(.*\))[^\)]*

[^\(]* matches everything that isn't an opening bracket at the beginning of the string, (\(.*\)) captures the required substring enclosed in brackets, and [^\)]* matches everything that isn't a closing bracket at the end of the string. Note that this expression does not attempt to match brackets; a simple parser (see dehmann's answer) would be more suitable for that.

Upvotes: 35

Douglas Leeder
Douglas Leeder

Reputation: 53285

The answer depends on whether you need to match matching sets of brackets, or merely the first open to the last close in the input text.

If you need to match matching nested brackets, then you need something more than regular expressions. - see @dehmann

If it's just first open to last close see @Zach

Decide what you want to happen with:

abc ( 123 ( foobar ) def ) xyz ) ghij

You need to decide what your code needs to match in this case.

Upvotes: 1

Related Questions