DeepSpace
DeepSpace

Reputation: 81594

Slight regex confusion - $ behavior when using multiline flag

Consider the following text:

!
interesting1 a
not interesting b
interesting2 c
!
interesting1 a
not interesting b
interesting2 c
!
interesting1 a
not interesting b
interesting2 c
not interesting arbitrary text d
!

As you may have guessed, I want to extract a and c from every section. The interesting2 c line is optional, but I only need a if there is also c (section-wise).

Using !\n(interesting1 (?P<a>.*?)$.*?(?:interesting2 (?P<c>.*?))?$\n(?=!)) I get:

a and c from the top 2 sections, but (understandably) a and c\nnot interesting arbitrary text d from the last section. See on regex101.

I doubt this is the most efficient regex for this situation as this small text requires 438 steps, so I'm open to any other more efficient solutions that will get the correct results.

If I change the regex to !\n(interesting1 (?P<a>.*?)$.*?(?:interesting2 (?P<c>\w+))?$\n(?=!)) (\w+ instead of .*? in capture group c) the only thing it matches in the third section is a (as expected since \w does not include \n).

What I don't understand is how to use $ in order to specify an optional line of arbitrary text between
interesting2 c and the closing !.

Using different variations of optional non-capturing groups and $ don't give me the correct results. I even tried optional non-capturing groups in the lookahead part (to signify that we may have additional/optional stuff before the !).

Upvotes: 0

Views: 62

Answers (2)

Aran-Fey
Aran-Fey

Reputation: 43146

What I don't understand is how to use $ in order to specify an optional line of arbitrary text between interesting2 c and the closing !.

That's because $ has nothing to do with matching an optional line of text. $ is just an anchor that asserts a position at the end of the string (or before a newline, if the regex is in multi-line mode). It is not at all required for matching a line of text.

The reason why your regex doesn't work is very simple: It's missing something that would match the optional line. As I've said before, $ is just an anchor - it doesn't consume any text. So in order to successfully match your (?=!) lookahead, the group c has to grow and match all the text up to the ! character. To prevent this from happening, you have to add something that can match the last line, like a .*? or [^\n]*.

In this specific case, though, it's not as simple as adding .*? before the (?=!) lookahead. Why? Because the c group is optional, and adding a .*? at the end would prevent the c group from matching:

!\n(interesting1 (?P<a>.*?)$.*?(?:interesting2 (?P<c>\w+))?$\n.*?(?=!))
                            ^  ^                              ^
                            |  |                              this .*? would grow
                            |  |                              and consume the
                            |  |                              "interesting2 c"
                            |  this group is optional, so it would be skipped
                            this .*? would match the empty string

So it's probably best to rewrite the regex from scratch.

Here's how I would write it:

!\ninteresting1 (?P<a>.*)(?:\n[^!].*)*\ninteresting2 (?P<c>.*)

The logic is pretty straightforward:

  1. !\ninteresting1 (?P<a>.*) matches the first line and captures a
  2. (?:\n[^!].*)* skips any line that doesn't start with a !
  3. \ninteresting2 (?P<c>.*) matches and captures c

This is slightly different from your regex, in that it will only produce a match if both a and c exist within a section. See also the online demo.

Upvotes: 1

yoonghm
yoonghm

Reputation: 4625

I use this

import re

text=\
"""
!
interesting1 a
not interesting b
interesting2 c
!
interesting1 a
not interesting b
interesting2 c
!
interesting1 a
not interesting b
interesting2 c
not interesting d
!
"""

pa = re.compile(r'^interesting[12] ([a-zA-Z]){1}', re.MULTILINE)
m = pa.findall(text)
print(m)

It has 6 mathces, 128 steps.

enter image description here

Upvotes: 1

Related Questions