sci9
sci9

Reputation: 776

Python regex: Split a large text file into smaller parts

Consider the following text file.

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| First Block of text   |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

----------------------- Monday 8 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| Second Block of text  |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

----------------------- Friday 12 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 3rd Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~
 
----------------------- Friday 19 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 4th Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

How can I extract the second, third, and fourth blocks and save them based on the date given above them? For example, I need to extract all the lines in the

     ~~~~~~~~~~~~~~~~~~~~~~~
    |                       |
    | Second Block of text  |
    |                       |
     ~~~~~~~~~~~~~~~~~~~~~~~

and then save it into a file or variable with the name Monday 8 August 2021.

With following regex I can find the lines contain the date: https://regex101.com/r/nKW1W4/1

-(?P<date>.*?)-

Upvotes: 1

Views: 675

Answers (3)

user16638334
user16638334

Reputation:

You can use:

input_text = """
 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| First Block of text   |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

----------------------- Monday 8 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| Second Block of text  |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

----------------------- Friday 12 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 3rd Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~
 
----------------------- Friday 19 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 4th Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~
"""

a = re.split(r'-+(.*?)-+', a)

for k, v in enumerate(a):
    a[k] = a[k].strip()

print(a)

list comprehension which bit which is a bit more concise suggested by @fsimonjetz

input_text = """
 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| First Block of text   |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

----------------------- Monday 8 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| Second Block of text  |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

----------------------- Friday 12 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 3rd Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~
 
----------------------- Friday 19 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 4th Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~
"""

result = [x.strip() for x in re.split(r'-+(.*?)-+', input_text)]

Upvotes: 3

The fourth bird
The fourth bird

Reputation: 163207

In your pattern you are only matching a single - at the left and right, and .*? matches 0+ chars other than a newline non greedy.

That will give you a lot of partial matches instead of matching the whole line.


You might also use a match, and use capture group 1 for the filename and capture group 2 for the data.

^-+([^-]+)-+((?:\n(?!--).*)*)

Explanation

  • ^ Start of string
  • -+ Match 1+ times -
  • ([^-]+) Capture group 1 for the date part, match all chars except -
  • -+ Match 1+ times -
  • ( Capture group 2 for the data part
    • (?:\n(?!--).*)* Match all lines that do not start with -- for example
  • ) Close group 2

Regex demo

For example

import re

pattern = r"^-+([^-]+)-+((?:\n(?!--).*)*)"

s = (" ~~~~~~~~~~~~~~~~~~~~~~~\n"
    "|                      |\n"
    "| First Block of text   |\n"
    "|                      |\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n\n"
    "----------------------- Monday 8 August 2021 -----------------------\n\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n"
    "|                      |\n"
    "| Second Block of text  |\n"
    "|                      |\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n\n"
    "----------------------- Friday 12 August 2021 -----------------------\n\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n"
    "|                      |\n"
    "| 3rd Block of text     |\n"
    "|                      |\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n"
    " \n"
    "----------------------- Friday 19 August 2021 -----------------------\n\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n"
    "|                      |\n"
    "| 4th Block of text     |\n"
    "|                      |\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n")

matches = re.findall(pattern, s, re.M)
if matches:
    filename = matches[0][0].strip();
    data = matches[0][1].strip();
    
    print(filename)
    print(data)

Output

Monday 8 August 2021
~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| Second Block of text  |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

Upvotes: 1

Jan
Jan

Reputation: 43169

You may match your blocks with the following expression and use the first group as the filename:

^
-+([^-]+)-+$
(.+?(?=^--|\Z))

See a demo on regex101.com (and mind the modifiers).

Upvotes: 3

Related Questions