user3038634
user3038634

Reputation: 23

Regex for anything but multicharacter token

I want to create a regex that returns everything between two multicharacter tokens where the opening token is ;;( and the closing token is ;;), such as

;;(
  Capture this part, which can contain everything except the closing token 
;;)

I thought the regex /;;\((?!;;\));;\)/ using negative lookahead should work but this is returning no matches. Is it possible to use a regex for this?

Upvotes: 2

Views: 171

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626853

In order to match some text between two multicharacter delimiters is a regex that is in line with the unroll-the-loop technique.

So, we have ;;( and ;;) delimiters.

The lazy dot matching regex is ;;\((.*?);;\). This pattern is not efficient since it will become slower and slower when larger and larger text comes in as input.

Unrolling it like ;;\(([^;]*(?:;(?!;\))[^;]*)*);;\) makes matching linear and the only problem can occur with speed if there are many ; inside the block.

It takes timgeb's solution 169 steps to complete the match. It takes mine just 16 steps.

Also, the unrolled regex does not depend on the /s DOTALL modifier, it can be omitted.

Why not use lookarounds? Lookarounds are good when you need overlapping matches or there are specific conditions. In this case, you need non-overlapping matches because the leading and trailing delimiters are not equal. Use capturing groups, pairs of unescaped parentheses around those subpatterns you need to get. In ;;\(([^;]*(?:;(?!;\))[^;]*)*);;\), we need to get all text that is not ;;), i.e. this [^;]*(?:;(?!;\))[^;]*)* part. Thus, we enclose it with ().

What does this unrolled part match?

  • [^;]* - anything but the ; (the first char of the trailing delimiter)
  • (?:;(?!;\))[^;]*)* - zero or more sequences of...
    • ;(?!;\)) - the first char of the trailing delimiter, a literal ; that is not followed by ;) (the rest of the trailing delimiter)
    • [^;]* - zero or more characters other than ; (the first char of the trailing delimiter)

Upvotes: 2

timgeb
timgeb

Reputation: 78690

Use a positive lookbehind and positive lookahead.

(?<=;;\().*?(?=;;\))

Demo: https://regex101.com/r/iK5wG4/2

Upvotes: 0

Related Questions