Dominik Scheld
Dominik Scheld

Reputation: 125

Python Regex - Extract text between (multiple) expressions in a textfile

I am a Python beginner and would be very thankful if you could help me with my text extraction problem.

I want to extract all text, which lies between two expressions in a textfile (the beginning and end of a letter). For both, the beginning and the end of the letter there are multiple possible expressions (defined in the lists "letter_begin" and "letter_end", e.g. "Dear", "to our", etc.). I want to analyze this for a bunch of files, find below an example of how such a textfile looks like -> I want to extract all text starting from "Dear" till "Douglas". In cases where the "letter_end" has no match, i.e. no letter_end expression is found, the output should start from the letter_beginning and end at the very end of the text file to be analyzed.

Edit: the end of "the recorded text" has to be after the match of "letter_end" and before the first line with 20 characters or more (as is the case for "Random text here as well" -> len=24.

"""Some random text here
 
Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas

Random text here as well"""

This is my code so far - but it is not able to flexible catch the text between the expressions (there can be anything (lines, text, numbers, signs, etc.) before the "letter_begin" and after the "letter_end")

import re

letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter 
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter 
closings = "|".join(letter_end)
regex = r"(?:" + openings + r")\s+.*?" + r"(?:" + closings + r"),\n\S+"


with open(filename, 'r', encoding="utf-8") as infile:
         text = infile.read()
         text = str(text)
         output = re.findall(regex, text, re.MULTILINE|re.DOTALL|re.IGNORECASE) # record all text between Regex (Beginning and End Expressions)
         print (output)

I am very thankful for every help!

Upvotes: 1

Views: 1715

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

You may use

regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)

This pattern will result in a regex like

(?:dear|to our|estimated)[\s\S]*?(?:sincerely|yours|best regards).*(?:\n.*){0,2}

See the regex demo. Note you should not use re.DOTALL with this pattern, and the re.MULTILINE option is also redundant.

Details

  • (?:dear|to our|estimated) - any of the three values
  • [\s\S]*? - any 0+ chars, as few as possible
  • (?:sincerely|yours|best regards) - any of the three values
  • .* - any 0+ chars other than newline
  • (?:\n.*){0,2} - zero, one or two repetitions of a newline followed with any 0+ chars other than newline.

Python demo code:

import re
text="""Some random text here

Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas

Random text here as well"""
letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter 
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter 
closings = "|".join(letter_end)
regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)
print(regex)
print(re.findall(regex, text, re.IGNORECASE))

Output:

['Dear Shareholders We\nare pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.\nBest regards \nDouglas\n']

Upvotes: 1

Related Questions