Reputation: 22324
This question is not related to variable length look-behind as it probably has a solution without negative look-behind.
In Python3, I am trying to match a pattern that is probably at the limit of what can be achieved with a regexp, but I still want to give it a try. I am actually trying to avoid using a parsing tool.
What I want to match is the pattern that indicate a regexp set. So the following would be matched.
[abc]
[1-9\n\t]
[ \t\]]
[\\\]]
[[\\\\\\\]]
The square brackets cannot be nested, by example in [[]]
, we want to match [[]
.
Although, since a \]
indicate an escaped bracket, we need to skip those. But a pattern such as \\]
must be accepted. The following would not be matched.
[\]
[\\\]
[abc\\\]
The rule ends up being match from [
to the first ]
that is not preceded by an odd amount of \
.
It seems negative lookbehind does not work because it must have fixed length.
Edit: An interesting solution was given by Wiktor Stribiżew
re.compile(r'\[[^]\\]*(?:\\.[^]\\]*)*\]')
Edit: Simpler version of the above by Rawing
r'\[(?:\\.|[^]\\])*\]'
Upvotes: 2
Views: 75
Reputation: 627082
You may use
re.compile(r'\[[^]\\]*(?:\\.[^]\\]*)*]', re.DOTALL)
See the regex demo.
Details
\[
- a [
char[^]\\]*
- 0 or more chars other than ]
and \
(?:
- start of a non-capturing group matching a sequence of:
\\.
- a \
char followed with any char[^]\\]*
- 0 or more chars other than ]
and \
)*
- zero or more repetitions of the patterns inside the non-capturing group]
- a ]
char.The regex follows the unroll-the-loop principle. Depending on the input it may work much, even 10+ times faster than the non-unrolled version, r'\[(?:\\.|[^]\\])*]'
, that is based on an infinitely quantified alternation group, causing lots of redundant backtracking steps.
Note that the regex above can fail when the initial [
is preceded with a backslash. In those cases, you will need
r'(?<!\\)(?:\\{2})*(\[[^]\\]*(?:\\.[^]\\]*)*])'
See this regex demo
The main difference here is (?<!\\)(?:\\{2})*
, a (?<!\\)
negative lookbehind that fails the match if the current position is preceded with a \
char, and (?:\\{2})*
that matches 0+ repetitions of two literal backslashes. The rest of the pattern is enclosed with capturing parentheses, and when a match is found, you just need to access match.group(1)
to get the right value.
Upvotes: 1