andreSmol
andreSmol

Reputation: 1038

Python, How should this regex work

I have a regex that should find all the "heading lines" that contain some text that do not end with a period or ? or !:

tit_pat = re.compile(r"([\w ]+?)(?![!?.])\n",re.UNICODE)
res = tit_par.findall(data)

: Example:

Chapter 1x test
This a test a test test test test. This a test with some text and more text.This a test with some text and more text some text and more text. This is some more text some more text some more tex some more text chapter aaa
This a test. This a test with some text and more text some text and more text some text and more text some text and more text.
bbbb
The end.

The regex is finding all the "heading lines" that contain some text without a period and a new line. That is expected because there is a (negative) look ahead statement that checks that are no periods (or ! or ?) before accepting. However I may have a sentence that starts in a line and ends with a period in the next line. The regex is not finding the line with text without a period. Is there an explanation for this behavior?

Upvotes: 0

Views: 74

Answers (1)

Karl Knechtel
Karl Knechtel

Reputation: 61617

Your regex basically means "find as few words as possible, such that there is no unwanted character after the words, and then find a newline immediately after those words". The word-checking part will not find the unwanted characters because they are not part of words, and the lookahead assertion is redundant because a newline is not an unwanted character.

What you seem to want is "find a line such that the last character is not one of the unwanted characters". This probably doesn't really call for regexes, but if you want to use them, the most obvious way imo is to take the text a line at a time and then search for something like (?<![.!?])$.

Upvotes: 1

Related Questions