Reputation: 25
Extracting from text For example; the following sentence contains the initial capital letters. How can I separate them?
Text:
A. lorem ipsum dolor sit B . 41dipiscing elit sedC. lorem ipsum dolor sit amet D. 35 Consectetur adipiscing E .Sed do eiusmod tempor
Goal:
A. lorem ipsum dolor sit
B . 41dipiscing elit sed
C. lorem ipsum dolor sit amet
D. 35 Consectetur adipiscing
E .Sed do eiusmod tempor
What have I done?
^(([a-zA-Z]{1}|[0-9]+)\s*[.,]{1})(.*)$
Result:
https://regex101.com/r/4HB0oD/1
But my Regex code doesn't detect it without first sentence. What is the reason of this?
Upvotes: 1
Views: 49
Reputation: 4614
This pattern should do what you're looking for:
[A-Z\d] ?\..+?(?=$|[A-Z\d] ?\.)
https://regex101.com/r/i92QR1/1
Upvotes: 0
Reputation: 27763
Maybe,
(?=[A-Z]\s*\.)
might work OK.
import re
string = '''
A. lorem ipsum dolor sit B . 41dipiscing elit sedC. lorem ipsum dolor sit amet D. 35 Consectetur adipiscing E .Sed do eiusmod tempor
'''
print(re.sub(r'(?=[A-Z]\s*\.)', '\n', string))
A. lorem ipsum dolor sit
B . 41dipiscing elit sed
C. lorem ipsum dolor sit amet
D. 35 Consectetur adipiscing
E .Sed do eiusmod tempor
If you wish to simplify/update/explore the expression, it's been explained on the top right panel of regex101.com. You can watch the matching steps or modify them in this debugger link, if you'd be interested. The debugger demonstrates that how a RegEx engine might step by step consume some sample input strings and would perform the matching process.
jex.im visualizes regular expressions:
Upvotes: 2