biglin
biglin

Reputation: 111

How to re-check matched expression in regex?

I am trying to capture the word following a cardinal number which is followed by a dot in a given text. For example, for the expression in quotation marks:

"1. text"

"text" should be returned. The "text" can be just plain letters or another number.

I have come up with the following regular expression which accomplishes exactly that:

r'(?:(?:(?<=\s)|(?<!.))\d+\.\s)([^\s.,:!?]*)'

The problem is that if "text" is of the same type as the non-capturing term, it is not checked again. Example:

"2. wordX wordY.": "wordX" is returned, expected behavior

"3. 4. wordZ.": "4" is returned, expected behavior.

I also expect to get "wordZ" as it matches in the expression "4. wordZ.", but it is not captured.

How do I get both where the matched expressions overlap?

Upvotes: 0

Views: 76

Answers (1)

The fourth bird
The fourth bird

Reputation: 163457

You could match the first digits dot and space pattern, and then start a capture group.

In that capture group, you can optionally repeat the same pattern followed by the character class.

Then for each match split on a dot and space.

  • (?<!\S) Assert a whitespace boundary
  • \d+\.\s Match digits dot and whitespace char
  • ( Capture group 1
    • (?:\d+\.\s)* Match optional repetitions of digits dot and whitespace char
    • [^\s.,:!?]+ Match 1+ times what is listed in the character class
  • ) Close group 1

Regex demo | Python demo

import re

pattern = r"(?<!\S)\d+\.\s((?:\d+\.\s)*[^\s.,:!?]+)"
strings = [
    "1. text",
    "2. wordX wordY.",
    "3. 4. wordZ."
]

for s in strings:
    for m in re.finditer(pattern, s):
        print(m.group(1).split(". "))

Output

['text']
['wordX']
['4', 'wordZ']

Another way could be using the PyPi regex module with an infinite quantifier in the lookbehind to look for digits dot and space on the left.

This is the same pattern structure as above, only the matches are now in the lookbehind and the group value is now a match.

(?<=(?<!\S)\d+\.\s(?:\s\d+\.\s)*)[^\s.,:!?]+

Regex demo | Python demo

import regex

pattern = r"(?<=(?<!\S)\d+\.\s(?:\s\d+\.\s)*)[^\s.,:!?]+"

strings = [
    "1. text",
    "2. wordX wordY.",
    "3. 4. wordZ."
]

for s in strings:
    print(regex.findall(pattern, s))

Output

['text']
['wordX']
['4', 'wordZ']

Upvotes: 1

Related Questions