Reputation: 111
I am trying to capture the word following a cardinal number which is followed by a dot in a given text. For example, for the expression in quotation marks:
"1. text"
"text" should be returned. The "text" can be just plain letters or another number.
I have come up with the following regular expression which accomplishes exactly that:
r'(?:(?:(?<=\s)|(?<!.))\d+\.\s)([^\s.,:!?]*)'
The problem is that if "text" is of the same type as the non-capturing term, it is not checked again. Example:
"2. wordX wordY.": "wordX" is returned, expected behavior
"3. 4. wordZ.": "4" is returned, expected behavior.
I also expect to get "wordZ" as it matches in the expression "4. wordZ.", but it is not captured.
How do I get both where the matched expressions overlap?
Upvotes: 0
Views: 76
Reputation: 163457
You could match the first digits dot and space pattern, and then start a capture group.
In that capture group, you can optionally repeat the same pattern followed by the character class.
Then for each match split on a dot and space.
(?<!\S)
Assert a whitespace boundary\d+\.\s
Match digits dot and whitespace char(
Capture group 1
(?:\d+\.\s)*
Match optional repetitions of digits dot and whitespace char[^\s.,:!?]+
Match 1+ times what is listed in the character class)
Close group 1import re
pattern = r"(?<!\S)\d+\.\s((?:\d+\.\s)*[^\s.,:!?]+)"
strings = [
"1. text",
"2. wordX wordY.",
"3. 4. wordZ."
]
for s in strings:
for m in re.finditer(pattern, s):
print(m.group(1).split(". "))
Output
['text']
['wordX']
['4', 'wordZ']
Another way could be using the PyPi regex module with an infinite quantifier in the lookbehind to look for digits dot and space on the left.
This is the same pattern structure as above, only the matches are now in the lookbehind and the group value is now a match.
(?<=(?<!\S)\d+\.\s(?:\s\d+\.\s)*)[^\s.,:!?]+
import regex
pattern = r"(?<=(?<!\S)\d+\.\s(?:\s\d+\.\s)*)[^\s.,:!?]+"
strings = [
"1. text",
"2. wordX wordY.",
"3. 4. wordZ."
]
for s in strings:
print(regex.findall(pattern, s))
Output
['text']
['wordX']
['4', 'wordZ']
Upvotes: 1