Reputation: 533
I have some text like this, it's written in a custom markdown style format. For example:
[Lorem ipsum]
Dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.
[Ut wisi]
[Enim ad minim veniam](a), quis nostrud exerci tation ullamcorper.
suscipit lobortis nisl ut aliquip ex ea commodo consequat. Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat.
Vel illum dolore eu feugiat nulla facilisis at vero.
[Ros et accumsan et iusto odio dignissim](b) qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi.
[[Nam liber]](c)
Tempor cum soluta nobis eleifend option congue nihil imperdiet doming id quod mazim placerat facer possim assum.
As you can see there are square brackets surounding headings, and there are square brackets followed by parenthesis containing a letter which is what I am trying to match with a regex. The regex I'm trying to use is this:
preg_match_all("#\[(.*?)\]\(([a-z]+)\)#is",$html,$matches)
The problem with this ^ one is it matches from [Lorem ipsum] down to the end of (a).
I could also use the following, however I need to be able to include headings with their square brackets so this doesn't work correctly:
preg_match_all("#\[([^]]+)\]\(([a-z]+)\)#is",$html,$matches)
After some reading up, I suspect what I need is a lookahead, however I've not been able to get my head around them. Any help much appreciated.
Clarification
I'm basically looking to be able to wrap any part of some text with the square brackets/parenthesis combination and then be able to match them with regex without existing square brackets anywhere causing conflicts. Example text:
[[Lorem ipsum]](a)
Dolor sit amet, [consectetuer adipiscing elit](b), sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.
Desired matches:
[[Lorem ipsum]](a)
[consectetuer adipiscing elit](b)
Or... more complex
[[Lorem ipsum]
Dolor sit amet, sed diam nonummy nibh euismod](a) tincidunt ut laoreet dolore magna aliquam erat volutpat.
Desired match:
[[Lorem ipsum]
Dolor sit amet, sed diam nonummy nibh euismod](a)
Is it possible?
Upvotes: 1
Views: 1589
Reputation: 34395
m.buettner's answer is excellent. It is both accurate and well documented (it got my up-vote and deserves to remain the selected answer). I really like the fact that the regex is self documented in free-spacing mode. However, for the sake of completeness, (and as a demonstration of another commenting style) here is an equivalent (but slightly more efficient) regex solution:
preg_match_all('/
# Match a "[...[...]...[...]...](...)" structure.
\[ # Literal open square bracket.
( # $1: Square bracket contents.
[^[\]]* # {normal*} Zero or more non-[].
(?: # Begin {(special normal*)*}.
\[[^[\]]*\] # {special} Nested matching [] pair.
[^[\]]* # More {normal*} Zero or more non-[].
)* # End {(special normal*)*}.
) # $1: Square bracket contents.
\] # Literal close square bracket.
(?: # Optional matching parentheses.
\( # Literal open parentheses.
([A-Za-z]+) # $2: Parentheses contents.
\) # Literal close parentheses.
)? # Optional matching parentheses.
/x',
$input,
$matches);
Improvements (mostly cosmetic/stylistic):
'single quotes'
rather than "double quotes"
. With PHP, there is an extra level of interpretation with double quoted strings and there are many more character escape sequences to be dealt with (the "$"
character in particular can cause mischief). Bottom line: with PHP, its best to enclose regex patterns within single quoted strings (i.e. less backslash soup).[nested [square bracket] structure]
was re-arranged to implement Friedl's "Unrolling-the-Loop" efficiency technique. This results in less backtracking for the case where the outer square brackets have no nested square brackets.s
single line modifier is removed (no need - there are no dots!).i
ignore case modifier is removed and the affected character class [a-z]
was changed to [A-Za-z]
to compensate. (Some regex engines run a wee bit faster when in case sensitive mode.)"]"
closing square bracket metacharacter is explicitly escaped, i.e. to: "\]"
. (although this is not required, it is good practice IMHO).$2
is consolidated onto one line.Upvotes: 1
Reputation: 44279
Here you go.
preg_match_all("~
\[( # open outer square brackets and capturing group
(?: # open subpattern for optional inner square brackets
[^[\]]* # non-square-bracket characters
\[ # open inner square bracket
[^[\]]* # non-square-bracket characters
] # close inner square bracket
)* # end subpattern and repeat it 0 or more times
[^[\]]* # non-square-bracket characters
)] # end capturing group and outer square brackets
(?: # open subpattern for optional parentheses
\(( # open parentheses and capturing group
[a-z]+ # letters
)\) # close capturing group and parentheses
)? # end subpattern and make it optional
~isx",
$input,
$matches);
And the regex in one line:
"~\[((?:[^[\]]*\[[^[\]]*])*[^[\]]*)](?:\(([a-z]+)\))?~isx"
Upvotes: 4
Reputation: 30283
I think you just need to tweak your first regex a tiny bit:
preg_match_all("#\[(.*?)\](?:\(([a-z]+)\))?#is",$html,$matches)
^^^ ^^
This way, the parenthesized letters are optional.
EDIT:
Given the clarifications, here's a new regex:
\[((?:[^][]|\[[^][]*?\])*?\](?:\(([a-z]+)\))?
Here is a Rubular demo.
Upvotes: 0