Olivier Melançon
Olivier Melançon

Reputation: 22324

Matching the pattern representing regexp set

This question is not related to variable length look-behind as it probably has a solution without negative look-behind.

In Python3, I am trying to match a pattern that is probably at the limit of what can be achieved with a regexp, but I still want to give it a try. I am actually trying to avoid using a parsing tool.

What I want to match is the pattern that indicate a regexp set. So the following would be matched.

[abc]
[1-9\n\t]
[ \t\]]
[\\\]]
[[\\\\\\\]]

The square brackets cannot be nested, by example in [[]], we want to match [[].

Although, since a \] indicate an escaped bracket, we need to skip those. But a pattern such as \\] must be accepted. The following would not be matched.

[\]
[\\\]
[abc\\\]

The rule ends up being match from [ to the first ] that is not preceded by an odd amount of \.

It seems negative lookbehind does not work because it must have fixed length.

Edit: An interesting solution was given by Wiktor Stribiżew

re.compile(r'\[[^]\\]*(?:\\.[^]\\]*)*\]')

Edit: Simpler version of the above by Rawing

r'\[(?:\\.|[^]\\])*\]'

Upvotes: 2

Views: 75

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

You may use

re.compile(r'\[[^]\\]*(?:\\.[^]\\]*)*]', re.DOTALL)

See the regex demo.

Details

  • \[ - a [ char
  • [^]\\]* - 0 or more chars other than ] and \
  • (?: - start of a non-capturing group matching a sequence of:
    • \\. - a \ char followed with any char
    • [^]\\]* - 0 or more chars other than ] and \
  • )* - zero or more repetitions of the patterns inside the non-capturing group
  • ] - a ] char.

The regex follows the unroll-the-loop principle. Depending on the input it may work much, even 10+ times faster than the non-unrolled version, r'\[(?:\\.|[^]\\])*]', that is based on an infinitely quantified alternation group, causing lots of redundant backtracking steps.

Note that the regex above can fail when the initial [ is preceded with a backslash. In those cases, you will need

r'(?<!\\)(?:\\{2})*(\[[^]\\]*(?:\\.[^]\\]*)*])'

See this regex demo

The main difference here is (?<!\\)(?:\\{2})*, a (?<!\\) negative lookbehind that fails the match if the current position is preceded with a \ char, and (?:\\{2})* that matches 0+ repetitions of two literal backslashes. The rest of the pattern is enclosed with capturing parentheses, and when a match is found, you just need to access match.group(1) to get the right value.

Upvotes: 1

Related Questions