Bruno Neves
Bruno Neves

Reputation: 23

Regex to find paragraph that contains a sentence in a multi-line text

I have a pdf extract text that look like this

========================================

TITLE

subtitle

Lorem Ipsum is simply dummy text of the printing

and typesetting industry. Lorem Ipsum has been

the industry's standard dummy text ever since the 1500s.

subtitle

Lorem Ipsum is simply dummy text of the printing and

typesetting industry. Lorem Ipsum has been the industry's

standard dummy text ever since the 1500s.

========================================

there is a new line ('\n') at the end of each line.

I am trying to find a given sentence using regex and extract the paragraph in which it was found. A paragraph is anything between two consecutive new lines (\n\n). Note that it has to be done using the lazy method.

FYI:

  1. The sentence can start in a line and end in another

  2. I cannot change the given text format

  3. There is a limit number of lines to return, so if I cant find \n\n after 10 lines up or down, I have to return 10 lines before and 10 lines after the regex keyword

Upvotes: 2

Views: 837

Answers (1)

mrxra
mrxra

Reputation: 862

something like this might get you started:

import re

data = """
ggg

aaa aaa aaa
more bla...

========================================

TITLE

subtitle

Lorem Ipsum is simply dummy text of the printing

and typesetting industry. Lorem Ipsum has been

the industry's standard dummy text ever since the 1500s.

subtitle

Lorem Ipsum is simply more bla of the printing and

typesetting industry. Lorem Ipsum has been the industry's

standard dummy text ever since the 1500s.

========================================

bla bla bla bla bla
more bla...

yet more bla
"""

if __name__ == "__main__":
    to_search = "more bla"
    print(re.findall(r"(?:(?<!^\n)\n(?!^\n)|[^\n])*"+re.escape(to_search)+r"(?:(?<!^\n)\n(?!^\n)|[^\n])*", data, re.DOTALL|re.MULTILINE|re.IGNORECASE))

important are the DOTALL and MULTILINE parameters to match newlines and search across lines. and also the lookaheads to detect 2 successive \n characters...

Upvotes: 1

Related Questions