Emre
Emre

Reputation: 25

Separating words with Regex (Not in specific order)

Extracting from text For example; the following sentence contains the initial capital letters. How can I separate them?

Text:

A. lorem ipsum dolor sit B . 41dipiscing elit sedC. lorem ipsum dolor sit amet D. 35 Consectetur adipiscing E .Sed do eiusmod tempor

Goal:

A. lorem ipsum dolor sit 
B . 41dipiscing elit sed 
C. lorem ipsum dolor sit amet 
D. 35 Consectetur adipiscing 
E .Sed do eiusmod tempor

What have I done?

^(([a-zA-Z]{1}|[0-9]+)\s*[.,]{1})(.*)$

Result:

https://regex101.com/r/4HB0oD/1

But my Regex code doesn't detect it without first sentence. What is the reason of this?

Upvotes: 1

Views: 49

Answers (2)

CAustin
CAustin

Reputation: 4614

This pattern should do what you're looking for:

[A-Z\d] ?\..+?(?=$|[A-Z\d] ?\.)

https://regex101.com/r/i92QR1/1

Upvotes: 0

Emma
Emma

Reputation: 27763

Maybe,

(?=[A-Z]\s*\.)

might work OK.

RegEx Demo

Test

import re

string = '''
A. lorem ipsum dolor sit B . 41dipiscing elit sedC. lorem ipsum dolor sit amet D. 35 Consectetur adipiscing E .Sed do eiusmod tempor
'''

print(re.sub(r'(?=[A-Z]\s*\.)', '\n', string))

Output


A. lorem ipsum dolor sit 
B . 41dipiscing elit sed
C. lorem ipsum dolor sit amet 
D. 35 Consectetur adipiscing 
E .Sed do eiusmod tempor


If you wish to simplify/update/explore the expression, it's been explained on the top right panel of regex101.com. You can watch the matching steps or modify them in this debugger link, if you'd be interested. The debugger demonstrates that how a RegEx engine might step by step consume some sample input strings and would perform the matching process.


RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Upvotes: 2

Related Questions