Reputation: 410
I came up with these two regex patterns
1.
\([0-9]\)\s+([^.!?]+[.!?])
2.
[.!?]\s+([A-Z].*?[.!?])
To match sentences in strings like these:
(1) A first sentence, that always follows a number in parantheses. This is my second sentence. This is my third sentence, (...) .
Thanks to your answers I archived to get the intro sentence after the number in parantheses. I as well get the 2nd sentence with my 2nd regex.
However the third sentence is not captured, since the .
was consumed before. My goal is to get the start point of these sentences by two methods:
(1)
How can I avoid the matching to fail for the 3rd and following sentences?
Thanks for any help!
Upvotes: 0
Views: 94
Reputation: 163207
You could use a capturing group with a negated character class [^
If you want to match 1 or more digits you could use [0-9]+
\([0-9]\)\s+([^.!?]+[.!?])
\([0-9]\)
Match a digit between parenthesis\s+
Match 1+ whitespace chars(
Capture group 1
[^.!?]+[.!?]
Match 1+ times any char other than .
, !
,?
. Then match one of them.)
Close groupFor example
import re
regex = r"\([0-9]\)\s+([^.!?]+[.!?])"
test_str = "(1) This is my first sentence, it has to be captured. This is my second sentence."
print(re.findall(regex, test_str))
Output
['This is my first sentence, it has to be captured.']
If you want to match the other sentences as well and be able to differentiate between the first sentence and the others, you might use an alternation with another capturing group
(?:\([0-9]\)\s+([^.!?]+[.!?])|([A-Z].*?\.)(?: |$))
Upvotes: 1
Reputation: 532
You have multiple options to do this.
The first one is lookbehind. You should replace ':'
with '<='
.
Unfortuantely it does not support variable length patterns. So just one space allowed
ss='(1) This is my first sentence, it has to be captured. This is my second sentence.'
re.search(r'(?<=\([0-9]\)\s).*?[.!?]', ss).group(0)
Output:
'This is my first sentence, it has to be captured.'
You can also search for a group:
re.search(r'\([0-9]\)\s+(.*?[.!?])', ss).group(1)
Output:
'This is my first sentence, it has to be captured.'
It allows variable length patterns
Both options with minimum modifications of your original pattern.
Upvotes: 1
Reputation: 147146
You can use your existing regex, just placing a group around the sentence part (.*?[.!?]
) and getting group 1 from the output of re.match
:
import re
para = '(1) This is my first sentence, it has to be captured. This is my second sentence.'
print(re.search(r'\([0-9]\)\s+(.*?[.!?])', para).group(1))
Output:
This is my first sentence, it has to be captured.
Upvotes: 1