zsfzu0
zsfzu0

Reputation: 156

Python regex does not match string as intended for some reason

I have the content of an English dictionary at hand and I want to find the definition for a specific example sentence.

For example, I want to find the definition for "example sentence 2b". In my opinion, the code may look lile this:

re.search(r'\d\. ([^\n]*?)\n(?!.*\d\. ).*?example sentence 2b', content, flags=re.DOTALL)

Here, the "content" is as follows:

1. definition1
example sentence 1a
example sentence 1b
2. definition2
example sentence 2a
example sentence 2b
3. definition3
example sentence 3a
example sentence 3b

Live test here - https://regex101.com/r/UOz6DA/1/

As you can see in the live test, I didn't get desired match - "definition2". I really don't know why.

PS: I used (?!.*\d\. ).* based on this post - regex how to exclude specific characters or string anywhere

Upvotes: 0

Views: 128

Answers (3)

zsfzu0
zsfzu0

Reputation: 156

The reason it won't match is due to the existence of "3. ", even though this substring is after "example sentence 2b".

For a simpler example, if you use the "s" flag in this live demo, the second line won't match any more because of the "chocolate" substring in the third line.

Upvotes: 0

41686d6564
41686d6564

Reputation: 19641

You may use the following pattern without the re.DOTALL flag:

^\d+\. (.*)(?:\n(?!\d+\. ).*)*\nexample sentence 2b

Regex demo.

Breakdown:

  • ^ - Beginning of line.
  • \d+\. - Match one or more digits, then a dot, and a space character.
  • (.*) - Match zero or more characters and capture them in group 1.
  • (?: - Beginning of a non-capturing group.
    • \n(?!\d+\. ) - Match a line-break that is not followed by a "definition line".
    • .* - Match zero or more characters.
  • ) - Close the non-capturing group.
  • *? - Match the previous group between zero and unlimited times (lazy).
  • \nexample sentence 2b - Match a linebreak character followed by the target sentence.

Upvotes: 2

Mark Rofail
Mark Rofail

Reputation: 828

You are missing the \n character to match break line. enter image description here

Upvotes: 0

Related Questions