fesja
fesja

Reputation: 3313

Regex to match expression with multiple parentheses, one within each other

I'm building a task (in PHP) that reads all the files of my project in search for i18n messages. I want to detect messages like these:

// Basic example
__('Show in English')  => Show in English
// Get the message and the name of the i18n file 
__("Show in English", array(), 'page') => Show in English, page
// Be careful of quotes
__("View Mary's Car", array()) => View Mary's Car
// Be careful of strings after the __() expression
__('at').' '.function($param) => at

The regex expression that works for those cases (there are some other cases taken into account) is:

__\(.*?['|\"](.*?)(?:['|\"][\.|,|\)])(?: *?array\(.*?\),.*?['|\"](.*?)['|\"]\)[^\)])?

However if the expression is in multiple lines it doesn't work. I have to include dotail /s, but it breaks the previous regex expresion as it doesn't control well when to stop looking ahead:

// Detect with multiple lines
echo __('title_in_place', array(
    '%title%' => $place['title']
  ), 'welcome-user'); ?>    

There is one thing that will solve the problem and simplify the regex expression that it's matching open-close parentheses. So no matter what's inside __() or how many parentheses there are, it "counts" the number of openings and expects that number of closings.

Is it possible? How? Thanks a lot!

Upvotes: 1

Views: 4392

Answers (4)

Mirocow
Mirocow

Reputation: 335

for me use such expression

(\(([^()]+)\))

i try find it

 * 1) (1+2)
 * 2) (1+2)+(3+2)
 * 3) (IF 1 THEN 1 ELSE 0) > (IF 2 THEN 1 ELSE 1)
 * 4) (1+2) -(4+ (3+2))
 * 5) (1+2) -((4+ (3+2)-(6-7)))

Upvotes: 0

ridgerunner
ridgerunner

Reputation: 34385

Yes. First, here is the classic example for simple nested brackets (parentheses):

\(([^()]|(?R))*\)

or faster versions which use a possesive quantifier:

\(([^()]++|(?R))*\)

or (equivalent) atomic grouping:

\((?>[^()]+|(?R))*\)

But you can't use the: (?R) "match whole expression" expression here because the outermost brackets are special (with two leading underscores). Here is a tested script which matches (what I think) you want...

Solution: Use group $1 (recursive) subroutine call: (?1)

<?php // test.php Rev:20120625_2200
$re_message = '/
    # match __(...(...)...) message lines (having arbitrary nesting depth).
    __\(                     # Outermost opening bracket (with leading __().
    (                        # Group $1: Bracket contents (subroutine).
      (?:                    # Group of bracket contents alternatives.
        [^()"\']++           # Either one or more non-brackets, non-quotes,
      | "[^"\\\\]*(?:\\\\[\S\s][^"\\\\]*)*"      # or a double quoted string,
      | \'[^\'\\\\]*(?:\\\\[\S\s][^\'\\\\]*)*\'  # or a single quoted string,
      | \( (?1) \)          # or a nested bracket (repeat group 1 here!).
      )*                    # Zero or more bracket contents alternatives.
    )                       # End $1: recursed subroutine.
    \)                      # Outermost closing bracket.
    .*                      # Match remainder of line following __()
    /mx';
$data = file_get_contents('testdata.txt');
$count = preg_match_all($re_message, $data, $matches);
printf("There were %d __(...) messages found.\n", $count);
for ($i = 0; $i < $count; ++$i) {
    printf("  message[%d]: %s\n", $i + 1, $matches[0][$i]);
}
?>

Note that this solution handles balanced parentheses (inside the "__(...)" construct) to any arbitrary depth (limited only by host memory). It also correctly handles quoted strings inside the "__(...)" and ignores any parentheses that may appear inside these quoted strings. Good luck. *

Upvotes: 1

Steve Wortham
Steve Wortham

Reputation: 22220

The only way I'm aware of pulling this off is with balanced group definitions. That's a feature in the .NET flavor of regular expressions, and is explained very well in this article.

And as Qtax noted, this can be done in PCRE with (?R) as decribed in their documentation.

Or this could also be accomplished by writing a custom parser. Basically the idea would be to maintain a variable called ParenthesesCount as you're parsing from left to right. You'd increment ParenthesesCount every time you see ( and decrement for every ). I've written a parser recently that handles nested parentheses this way.

Upvotes: 0

Mark Byers
Mark Byers

Reputation: 838116

Matching balanced parentheses is not possible with regular expressions (unless you use an engine with non-standard non-regular extensions, but even then it's still a bad idea and will be hard to maintain).

You could use a regular expression to find lines containing potential matches, then iterate over the string character by character counting the number of open and close parentheses until you find the index of the matching closing parenthesis.

Upvotes: 1

Related Questions