JustinLovinger
JustinLovinger

Reputation: 922

Python regex erronously matches trailing newlines

A regular expression with an end anchor ($) completely ignores the presence of a trailing newline when matching.

Ex.

import re

regex = re.compile(r'^$')

text = "\n"
print regex.match(text)

The above code snippet will match the text containing "\n". Since the regular expression above has nothing between the start and end anchors, I assume it should only match the null string.

Is there any way to work around this behavior?

P.S. The above code is a simplified regular expression to illustrate the problem. The actual regular expression that I'm using is:

re.compile(r'^\S(?:\S| (?!\s)){0,199}$(?<=\S)')

Which also matches text containing trailing newlines.

Upvotes: 2

Views: 322

Answers (1)

ErikR
ErikR

Reputation: 52029

Use \Z to match the end of the buffer and \A to match the beginning of the buffer.

Update: The reason why ^$ doesn't do what you want is because the rules for matching $ are:

  • if the buffer ends in a newline $ matches just before the final newline
  • otherwise $ matches the end of the buffer

If the regex is compiled with re.MULTLINE then $ will also match just before any internal newline.

Here is some code which demonstrates this:

import re

def showit(r, inp):
  ms = r.finditer(inp)
  for i,m in enumerate(ms):
    print "  match", i, " start:", m.start(0), " end:", m.end(0)
  print ""

print "regex x$ against x\\nx"
showit(re.compile("x$"), "x\nx")

print "regex x$ against x\\nx\\n"
showit(re.compile("x$"), "x\nx\n")

print "regex x$ re.MULTILINE against x\\nx"
showit(re.compile("x$", re.MULTILINE), "x\nx")

Output:

regex x$ against x\nx
  match 0  start: 2  end: 3

regex x$ against x\nx\n
  match 0  start: 2  end: 3

regex x$ re.MULTILINE against x\nx
  match 0  start: 0  end: 1
  match 1  start: 2  end: 3

Upvotes: 7

Related Questions