DanielTheRocketMan
DanielTheRocketMan

Reputation: 3249

Split the text in paragraphs

I know that I can use something like this

theText='She loves music. Her favorit instrument is the piano.\n\n However, \n\n she does not play it.'
paragraphs = [p for p in theText.split('\n\n') if p]
for i,p in enumerate(paragraphs):
    print(i,p)

to split theText in paragraphs.

However, I would like to add an additional conditional that the next sentence cannot start with a lower case letter. The actual code provides

0 She loves music. Her favorit instrument is the piano.
1  However, 
2  she does not play it.

I would like

0 She loves music. Her favorit instrument is the piano.
1  However, she does not play it.

I believe that I should use some regex, but I could not figure out the correct structure.

Upvotes: 0

Views: 60

Answers (1)

sacuL
sacuL

Reputation: 51335

You can use the following regex, which ensures that your \n\n is followed by a capital letter (and an optional space) using the Lookahead ?=. Also, in your enumerate, you'll have to get rid of your \n\n (here, using re.sub):

import re
paragraphs = re.split('\n\n\s?(?=[A-Z])',theText)
for i,p in enumerate(paragraphs):
    print(i,re.sub('\n\n\s?','',p))

0 She loves music. Her favorit instrument is the piano.
1 However, she does not play it.

Upvotes: 1

Related Questions