lucas
lucas

Reputation: 75

Regex any characters except some

Im trying to create a regex to catch [[xyz|asd]], but not [[xyz]] In the text:

'''Diversity Day'''" is the second episode of the [[The Office (U.S. season 1)]|first season]] of the American [[comedy]] [[television program|television series]] ''[[The Office (U.S. TV series)|The Office]]'', and the show's second episode overall. Written by [[B. J. Novak]] and directed by [[Ken Kwapis]], it first aired in the United States on March 29, 2005, on [[NBC]]. The episode guest stars ''Office'' consulting producer [[Larry Wilmore]] as [[List_of_characters_from_The_Office_(US)#Mr._Brown|Mr. Brown]].

The following results should be captured:

[[The Office (U.S. season 1)]|first season]] <-- keep in mind of the "]" before "|", "]" in that case is a literal character not a breaking one "]]"
[[television program|television series]]
[[The Office (U.S. TV series)|The Office]]
[[List_of_characters_from_The_Office_(US)#Mr._Brown|Mr. Brown]]

I was trying to use is:

\[\[([^|]+)\|([^|]+)\]\]

but i cant figure out how to ignore both "|" and "]]" in the groups. [^|(]])] wont work because it wont match "]]" but only the character "]" (it needs to be the whole word)

Please help, thanks!

Upvotes: 2

Views: 545

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626825

You may rely on a tempered greedy token here:

\[\[((?:(?!]]).)*)\|((?:(?!]]).)*)]]

See the regex demo

Details:

  • \[\[ - 2 [ symbols
  • ((?:(?!]]).)*) - Group 1 (note the * can be turned into a lazy *? here especially if the first parts are shorter than the second parts) capturing:
    • (?:(?!]]).)* - zero or more sequences of
      • . - any char (but a newline, use the pattern with RegexOptions.Singleline if your strings span across multiple lines)...
      • (?!]]) - that is not starting a ]] sequence (i.e. if the . does not match a ] that is followed with another ])
  • \| - a literal |
  • ((?:(?!]]).)*) - Group 2 capturing the same subpattern as Group 2
  • ]] - 2 literal ] on end.

A much more efficient "unrolled" version of this regex is:

\[\[([^]|]*(?:](?!])[^]|]*)*)\|([^]]*(?:](?!])[^]]*)*)]]

See the regex demo. This regex will treat the first | as the inner field separator. See my other answer about how to unroll tempered greedy tokens.

enter image description here

Upvotes: 6

Related Questions