Python regex erronously matches trailing newlines

Question

A regular expression with an end anchor ($) completely ignores the presence of a trailing newline when matching.

Ex.

import re

regex = re.compile(r'^$')

text = "
"
print regex.match(text)

The above code snippet will match the text containing " ". Since the regular expression above has nothing between the start and end anchors, I assume it should only match the null string.

Is there any way to work around this behavior?

P.S. The above code is a simplified regular expression to illustrate the problem. The actual regular expression that I'm using is:

re.compile(r'^\S(?:\S| (?!\s)){0,199}$(?<=\S)')

Which also matches text containing trailing newlines.

ErikR · Accepted Answer

Use \Z to match the end of the buffer and \A to match the beginning of the buffer.

Update: The reason why ^$ doesn't do what you want is because the rules for matching $ are:

if the buffer ends in a newline $ matches just before the final newline
otherwise $ matches the end of the buffer

If the regex is compiled with re.MULTLINE then $ will also match just before any internal newline.

Here is some code which demonstrates this:

import re

def showit(r, inp):
  ms = r.finditer(inp)
  for i,m in enumerate(ms):
    print "  match", i, " start:", m.start(0), " end:", m.end(0)
  print ""

print "regex x$ against x\nx"
showit(re.compile("x$"), "x
x")

print "regex x$ against x\nx\n"
showit(re.compile("x$"), "x
x
")

print "regex x$ re.MULTILINE against x\nx"
showit(re.compile("x$", re.MULTILINE), "x
x")

Output:

regex x$ against x
x
  match 0  start: 2  end: 3

regex x$ against x
x

  match 0  start: 2  end: 3

regex x$ re.MULTILINE against x
x
  match 0  start: 0  end: 1
  match 1  start: 2  end: 3

Python regex erronously matches trailing newlines

Answers (1)

Related Questions