errorinpersona
errorinpersona

Reputation: 410

Python Regex: non capturing group is captured

I came up with these two regex patterns

1.

\([0-9]\)\s+([^.!?]+[.!?])

2.

[.!?]\s+([A-Z].*?[.!?])

To match sentences in strings like these:

(1) A first sentence, that always follows a number in parantheses. This is my second sentence. This is my third sentence, (...) .

Thanks to your answers I archived to get the intro sentence after the number in parantheses. I as well get the 2nd sentence with my 2nd regex.

However the third sentence is not captured, since the . was consumed before. My goal is to get the start point of these sentences by two methods:

  1. Getting the "intro" sentence by capturing the start after (1)
  2. Getting any other sentence by recognizing the dot, a whitespace and a Capital letter after it.

How can I avoid the matching to fail for the 3rd and following sentences?

Thanks for any help!

Upvotes: 0

Views: 94

Answers (3)

The fourth bird
The fourth bird

Reputation: 163207

You could use a capturing group with a negated character class [^ If you want to match 1 or more digits you could use [0-9]+

\([0-9]\)\s+([^.!?]+[.!?])
  • \([0-9]\) Match a digit between parenthesis
  • \s+ Match 1+ whitespace chars
  • ( Capture group 1
    • [^.!?]+[.!?] Match 1+ times any char other than ., !,?. Then match one of them.
  • ) Close group

Regex demo | Python demo

For example

import re

regex = r"\([0-9]\)\s+([^.!?]+[.!?])"
test_str = "(1) This is my first sentence, it has to be captured. This is my second sentence."

print(re.findall(regex, test_str))

Output

['This is my first sentence, it has to be captured.']

If you want to match the other sentences as well and be able to differentiate between the first sentence and the others, you might use an alternation with another capturing group

(?:\([0-9]\)\s+([^.!?]+[.!?])|([A-Z].*?\.)(?: |$))

Regex demo

Upvotes: 1

Sergey
Sergey

Reputation: 532

You have multiple options to do this. The first one is lookbehind. You should replace ':' with '<='. Unfortuantely it does not support variable length patterns. So just one space allowed

ss='(1) This is my first sentence, it has to be captured. This is my second sentence.'

re.search(r'(?<=\([0-9]\)\s).*?[.!?]', ss).group(0)

Output:

'This is my first sentence, it has to be captured.'

You can also search for a group:

re.search(r'\([0-9]\)\s+(.*?[.!?])', ss).group(1)

Output:

'This is my first sentence, it has to be captured.'

It allows variable length patterns

Both options with minimum modifications of your original pattern.

Upvotes: 1

Nick
Nick

Reputation: 147146

You can use your existing regex, just placing a group around the sentence part (.*?[.!?]) and getting group 1 from the output of re.match:

import re

para = '(1) This is my first sentence, it has to be captured. This is my second sentence.'
print(re.search(r'\([0-9]\)\s+(.*?[.!?])', para).group(1))

Output:

This is my first sentence, it has to be captured.

Upvotes: 1

Related Questions