Reputation: 859
I try to Match Paragraphs using Python and Re.
An example of a text:
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum.
two or more line breaks here
Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
two or more line breaks here
Ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
This Expression seems to almost do the job:
paragraphs = re.findall(r'(?s)((?:[^\n][\n]?)+)', textContent)
But I want to make sure to only match if there are two or more line-breaks. Currently it matches too often.
Edit:
ART. WEFWEFEW
1 SDVSDRG: **<at the momemnt it breaks here, but it shouldnt>**
a. wevvdfvdfd
b. sdfsdfsdfsdfsdfsdghtrhrth
Edit2:
ART. WEFWEFEW
1 SDVSDRG:
**here are two line-breaks, but dont split this paragraph**
**at the momemnt it breaks here, but it shouldnt**
a. wevvdfvdfd
b. sdfsdfsdfsdfsdfsdghtrhrth
Upvotes: 1
Views: 182
Reputation: 3439
Check out this regex (?m)(?:.+(?:\n.)?)+
on RegEx101, where you can also get an explanation of it.
Sample Python code that uses this regex:
import re
import pprint
textContent = '''Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua. At vero eos et
accusam et justo duo dolores et ea rebum.
Stet clita kasd gubergren, no sea takimata sanctus est Lorem
ipsum dolor sit amet.
Ipsum dolor sit amet, consetetur sadipscing elitr, sed diam
nonumy eirmod tempor invidunt ut labore et dolore magna
aliquyam erat, sed diam voluptua. At vero eos et accusam et
justo duo dolores et ea rebum. Stet clita kasd gubergren, no
sea takimata sanctus est Lorem ipsum dolor sit amet.
ART. WEFWEFEW
1 SDVSDRG:
a. wevvdfvdfd
b. sdfsdfsdfsdfsdfsdghtrhrth'''
pprint.pprint(re.findall(r'(?m)(?:.+(?:\n.)?)+', textContent))
Output:
['Lorem ipsum dolor sit amet, consetetur sadipscing elitr,\n'
'sed diam nonumy eirmod tempor invidunt ut labore et dolore\n'
'magna aliquyam erat, sed diam voluptua. At vero eos et\n'
'accusam et justo duo dolores et ea rebum.',
'Stet clita kasd gubergren, no sea takimata sanctus est Lorem\n'
'ipsum dolor sit amet.',
'Ipsum dolor sit amet, consetetur sadipscing elitr, sed diam\n'
'nonumy eirmod tempor invidunt ut labore et dolore magna\n'
'aliquyam erat, sed diam voluptua. At vero eos et accusam et\n'
'justo duo dolores et ea rebum. Stet clita kasd gubergren, no\n'
'sea takimata sanctus est Lorem ipsum dolor sit amet.',
'ART. WEFWEFEW\n'
' 1 SDVSDRG:\n'
' a. wevvdfvdfd\n'
' b. sdfsdfsdfsdfsdfsdghtrhrth']
Demo on Rextester.
Upvotes: 0