Enrico Massone
Enrico Massone

Reputation: 7348

Regex negative look ahead to match markdown links

We are stuck over a regex issue.

Here is the problem. Consider the following two patterns:

1) [hello] [world]

2) [hello [world]]

We need to write a regex able to match only [world] in the first one and the entire pattern ([hello [world]]) in the second.

By using the negative lookahead, I wrote the following regex which solves part of the problem:

\[[^\[\]]+\](?!.*\[[^\[\]]+\])

This regex matches pattern 1) as we want, but does not work for pattern 2).

Upvotes: 5

Views: 409

Answers (3)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89614

A more simple way to find the last balanced square brackets part in a string with the .net regex engine is to search the string from right to left using the Regex.RightToLeft property. This way you avoid:

  • to search all the string for nothing
  • to check the end of the string with a lookahead since the pattern returns the first match on the right.

code:

string input = @"[hello] [world] [hello [world\]] ]";
string rtlPattern = @"(?(c)(?!))\[(?>\\.|(?<!\\)[^][]+|(?<-c>)\[|(?<c>)])*]";
Match m;

m = Regex.Match(input, rtlPattern, RegexOptions.RightToLeft);

if (m.Success)
    Console.WriteLine("Result: {0}", m.Groups[0].Value);

demo

Note that to well understand what happens you also need to read parts of the pattern from right to left. Details:

]  # a literal closing square bracket

(?> # open an atomic group (*)
    \\.         # any escaped character with a backslash
  |
    [^][]+  # all that isn't a square bracket
    (?<!\\) # not preceded by a backslash
  |
    (?<-c>) \[  # decrement the c stack for an opening bracket
  |
    (?<c>)   ]  # increment the c stack for a closing bracket
)* # repeat zero or more times

\[  # a literal square opening bracket

(?(c) # conditional statement: true if c isn't empty
    (?!) # always failing pattern: "not followed by nothing"
)

(*) Note that using an atomic group is mandatory here to avoid an eventual catastrophic backtracking since the group contains an item with a + quantifier and is itself repeated. You can learn more about this problem here.

This pattern already deals with escaped nested brackets and you can also add the Regex.Singleline property if you want to match a part that includes the newline character.

Upvotes: 0

Davide Icardi
Davide Icardi

Reputation: 12219

Here another possible solution to match all markdown links if "correctly" escaped.

Here the regex:

\[(?<text>(?:[^\[\]]|\\\[|\\\])+?)\]\((?<link>.+?)\)

See regex 101 demo.

Note that this not support NOT escaped brackets inside links:

[link number \[2]](http://myurl.com)
[link number [2\]](http://myurl.com)

It may also not support other edge cases...

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627292

In .NET regex, you may use balanced groups to match nested balanced parentheses. So, to match the last [...] substring (with nested parentheses) on a line you need quite a long pattern like

\[(?:[^][]+|(?<c>)\[|(?<-c>)])*(?(c)(?!))](?!.*\[(?:[^][]+|(?<d>)\[|(?<-d>)])*(?(d)(?!))])

See the regex demo at RegexStorm.net.

Details

  • \[(?:[^][]+|(?<c>)\[|(?<-c>)])*(?(c)(?!))] - a [...] substring with nested brackets:
    • \[ - a [ char
    • (?:[^][]+|(?<c>)\[|(?<-c>)])* - zero or more occurrences of:
      • [^][]+| - 1 or more chars other than ] and [ or
      • (?<c>)\[| - empty value added to Group "c" and a [ is matched
      • (?<-c>)] - empty value is subtracted from Group "c" stack and a ] is matched
    • (?(c)(?!)) - a conditional that fails the match if Group "c" stack is not empty
    • ] - a ] char
  • (?!.*\[(?:[^][]+|(?<d>)\[|(?<-d>)])*(?(d)(?!))]) - not followed with any 0+ chars other than newline symbols followed with the same pattern as the one above.

Upvotes: 2

Related Questions