Reputation: 23
I have a pdf extract text that look like this
========================================
TITLE
subtitle
Lorem Ipsum is simply dummy text of the printing
and typesetting industry. Lorem Ipsum has been
the industry's standard dummy text ever since the 1500s.
subtitle
Lorem Ipsum is simply dummy text of the printing and
typesetting industry. Lorem Ipsum has been the industry's
standard dummy text ever since the 1500s.
========================================
there is a new line ('\n') at the end of each line.
I am trying to find a given sentence using regex and extract the paragraph in which it was found. A paragraph is anything between two consecutive new lines (\n\n). Note that it has to be done using the lazy method.
FYI:
The sentence can start in a line and end in another
I cannot change the given text format
There is a limit number of lines to return, so if I cant find \n\n after 10 lines up or down, I have to return 10 lines before and 10 lines after the regex keyword
Upvotes: 2
Views: 837
Reputation: 862
something like this might get you started:
import re
data = """
ggg
aaa aaa aaa
more bla...
========================================
TITLE
subtitle
Lorem Ipsum is simply dummy text of the printing
and typesetting industry. Lorem Ipsum has been
the industry's standard dummy text ever since the 1500s.
subtitle
Lorem Ipsum is simply more bla of the printing and
typesetting industry. Lorem Ipsum has been the industry's
standard dummy text ever since the 1500s.
========================================
bla bla bla bla bla
more bla...
yet more bla
"""
if __name__ == "__main__":
to_search = "more bla"
print(re.findall(r"(?:(?<!^\n)\n(?!^\n)|[^\n])*"+re.escape(to_search)+r"(?:(?<!^\n)\n(?!^\n)|[^\n])*", data, re.DOTALL|re.MULTILINE|re.IGNORECASE))
important are the DOTALL and MULTILINE parameters to match newlines and search across lines. and also the lookaheads to detect 2 successive \n characters...
Upvotes: 1