Adam
Adam

Reputation: 533

Regex matching square brackets followed by parenthesis where the square brackets can also contain other square brackets

I have some text like this, it's written in a custom markdown style format. For example:

[Lorem ipsum] 
Dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. 

[Ut wisi] 
[Enim ad minim veniam](a), quis nostrud exerci tation ullamcorper. 
suscipit lobortis nisl ut aliquip ex ea commodo consequat. Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat. 
Vel illum dolore eu feugiat nulla facilisis at vero.
[Ros et accumsan et iusto odio dignissim](b) qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. 

[[Nam liber]](c)
Tempor cum soluta nobis eleifend option congue nihil imperdiet doming id quod mazim placerat facer possim assum.

As you can see there are square brackets surounding headings, and there are square brackets followed by parenthesis containing a letter which is what I am trying to match with a regex. The regex I'm trying to use is this:

preg_match_all("#\[(.*?)\]\(([a-z]+)\)#is",$html,$matches)

The problem with this ^ one is it matches from [Lorem ipsum] down to the end of (a).

I could also use the following, however I need to be able to include headings with their square brackets so this doesn't work correctly:

preg_match_all("#\[([^]]+)\]\(([a-z]+)\)#is",$html,$matches)

After some reading up, I suspect what I need is a lookahead, however I've not been able to get my head around them. Any help much appreciated.


Clarification

I'm basically looking to be able to wrap any part of some text with the square brackets/parenthesis combination and then be able to match them with regex without existing square brackets anywhere causing conflicts. Example text:

[[Lorem ipsum]](a)
Dolor sit amet, [consectetuer adipiscing elit](b), sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. 

Desired matches:

[[Lorem ipsum]](a)
[consectetuer adipiscing elit](b)

Or... more complex

[[Lorem ipsum]
Dolor sit amet, sed diam nonummy nibh euismod](a) tincidunt ut laoreet dolore magna aliquam erat volutpat. 

Desired match:

[[Lorem ipsum]
Dolor sit amet, sed diam nonummy nibh euismod](a)

Is it possible?

Upvotes: 1

Views: 1589

Answers (3)

ridgerunner
ridgerunner

Reputation: 34395

m.buettner's answer is excellent. It is both accurate and well documented (it got my up-vote and deserves to remain the selected answer). I really like the fact that the regex is self documented in free-spacing mode. However, for the sake of completeness, (and as a demonstration of another commenting style) here is an equivalent (but slightly more efficient) regex solution:

preg_match_all('/
    # Match a "[...[...]...[...]...](...)" structure.
    \[               # Literal open square bracket.
    (                # $1: Square bracket contents.
      [^[\]]*        # {normal*} Zero or more non-[].
      (?:            # Begin {(special normal*)*}.
        \[[^[\]]*\]  # {special} Nested matching [] pair.
        [^[\]]*      # More {normal*} Zero or more non-[].
      )*             # End {(special normal*)*}.
    )                # $1: Square bracket contents.
    \]               # Literal close square bracket.
    (?:              # Optional matching parentheses.
      \(             # Literal open parentheses.
      ([A-Za-z]+)    # $2: Parentheses contents.
      \)             # Literal close parentheses.
    )?               # Optional matching parentheses.
    /x',
    $input,
    $matches);

Improvements (mostly cosmetic/stylistic):

  • The regex is enclosed within 'single quotes' rather than "double quotes". With PHP, there is an extra level of interpretation with double quoted strings and there are many more character escape sequences to be dealt with (the "$" character in particular can cause mischief). Bottom line: with PHP, its best to enclose regex patterns within single quoted strings (i.e. less backslash soup).
  • The expression logic which matches the [nested [square bracket] structure] was re-arranged to implement Friedl's "Unrolling-the-Loop" efficiency technique. This results in less backtracking for the case where the outer square brackets have no nested square brackets.
  • The capture groups' open and close parentheses (which span more than one line) are indented to the same level (i.e. are vertically aligned) to ease visually matching.
  • The capture group number is included in the comments on the lines with the open and close parentheses.
  • The s single line modifier is removed (no need - there are no dots!).
  • The i ignore case modifier is removed and the affected character class [a-z] was changed to [A-Za-z] to compensate. (Some regex engines run a wee bit faster when in case sensitive mode.)
  • The literal "]" closing square bracket metacharacter is explicitly escaped, i.e. to: "\]". (although this is not required, it is good practice IMHO).
  • Capture group $2 is consolidated onto one line.
  • A full width header comment is added at the top of the regex describing the overall regex purpose.

Upvotes: 1

Martin Ender
Martin Ender

Reputation: 44279

Here you go.

preg_match_all("~
    \[(              # open outer square brackets and capturing group
    (?:              # open subpattern for optional inner square brackets
        [^[\]]*      # non-square-bracket characters
        \[           # open inner square bracket
        [^[\]]*      # non-square-bracket characters
        ]            # close inner square bracket
    )*               # end subpattern and repeat it 0 or more times
    [^[\]]*          # non-square-bracket characters
    )]               # end capturing group and outer square brackets
    (?:              # open subpattern for optional parentheses
        \((          # open parentheses and capturing group
        [a-z]+       # letters
        )\)          # close capturing group and parentheses
    )?               # end subpattern and make it optional
    ~isx",
    $input,
    $matches);

And the regex in one line:

"~\[((?:[^[\]]*\[[^[\]]*])*[^[\]]*)](?:\(([a-z]+)\))?~isx"

Working demo

Upvotes: 4

Andrew Cheong
Andrew Cheong

Reputation: 30283

I think you just need to tweak your first regex a tiny bit:

preg_match_all("#\[(.*?)\](?:\(([a-z]+)\))?#is",$html,$matches)
                          ^^^            ^^

This way, the parenthesized letters are optional.

EDIT:

Given the clarifications, here's a new regex:

\[((?:[^][]|\[[^][]*?\])*?\](?:\(([a-z]+)\))?

Here is a Rubular demo.

Upvotes: 0

Related Questions