10SecTom
10SecTom

Reputation: 2664

Python regex match across multiple lines

I am trying to match a regex pattern across multiple lines. The pattern begins and ends with a substring, both of which must be at the beginning of a line. I can match across lines, but I can't seem to specify that the end pattern must also be at the beginning of a line.

Example string:

Example=N      ; Comment Line One error=

; Comment Line Two.

Desired=

I am trying to match from Example= up to Desired=. This will work if error= is not in the string. However, when it is present I match Example=N ; Comment Line One error=

config_value = 'Example'
pattern = '^{}=(.*?)([A-Za-z]=)'.format(config_value)
match = re.search(pattern, string, re.M | re.DOTALL)

I also tried:

config_value = 'Example'
pattern = '^{}=(.*?)(^[A-Za-z]=)'.format(config_value)
match = re.search(pattern, string, re.M | re.DOTALL)

Upvotes: 5

Views: 17384

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627469

Your pattern contains .*? and your options include re.DOTALL (also, re.S is its equivalent) that makes the . match newlines, too. For those of you who wonder why your regex does not match across multiple lines, check these first:

  • Make sure you read the whole text where an expected match can span across lines into a single variable (remember that regex can only be used to search for strings inside strings, text)
  • Make sure your pattern actually can match any characters and the most common problem is that you use . without re.S/re.DOTALL flags (or their inline equivalent (?s)). As an example, you can't expect a match in re.search(r'a(.*?)b', '__ a\nb __'), but you will find a match in re.search(r'a(.*?)b', '__ a\nb __', re.DOTALL) or re.search(r'(?s)a(.*?)b', '__ a\nb __')
  • Make sure you test the regex properly. If you use online regex testing tools, remember to replace any escape sequences with literal symbols. It is a common issue to just copy/paste the string literal and use it in the example text field, which leads to confusion. So, if you want to test against '__ a\nb __', paste it there, and replace the \n with the real literal line break
  • DO NOT CONFUSE re.S/re.DOTALL with re.M/re.MULTILINE. The latter is used to re-define the behavior of the ^ and $ anchors only. It means the ^ will match the start of any line and the $ will match the end of any line if you use re.M/re.MULTILINE. It does not mean your regex will automatically start finding matches that span across multiple lines, please mind this.

More references:

Now, for this concrete case in the OP, you may use

config_value = 'Example'
pattern=r'(?sm)^{}=(.*?)(?=[\r\n]+\w+=|\Z)'.format(config_value)
match = re.search(pattern, s)
if match:
    print(match.group(1))

See the Python demo.

Pattern details

  • (?sm) - re.DOTALL and re.M are on
  • ^ - start of a line
  • Example= - a substring
  • (.*?) - Group 1: any 0+ chars, as few as possible
  • (?=[\r\n]+\w+=|\Z) - a positive lookahead that requires the presence of 1+ CR or LF symbols followed with 1 or more word chars followed with a = sign, or end of the string (\Z).

See the regex demo.

Upvotes: 10

Related Questions