Reputation: 776
Consider the following text file.
~~~~~~~~~~~~~~~~~~~~~~~
| |
| First Block of text |
| |
~~~~~~~~~~~~~~~~~~~~~~~
----------------------- Monday 8 August 2021 -----------------------
~~~~~~~~~~~~~~~~~~~~~~~
| |
| Second Block of text |
| |
~~~~~~~~~~~~~~~~~~~~~~~
----------------------- Friday 12 August 2021 -----------------------
~~~~~~~~~~~~~~~~~~~~~~~
| |
| 3rd Block of text |
| |
~~~~~~~~~~~~~~~~~~~~~~~
----------------------- Friday 19 August 2021 -----------------------
~~~~~~~~~~~~~~~~~~~~~~~
| |
| 4th Block of text |
| |
~~~~~~~~~~~~~~~~~~~~~~~
How can I extract the second, third, and fourth blocks and save them based on the date given above them? For example, I need to extract all the lines in the
~~~~~~~~~~~~~~~~~~~~~~~
| |
| Second Block of text |
| |
~~~~~~~~~~~~~~~~~~~~~~~
and then save it into a file or variable with the name Monday 8 August 2021
.
With following regex I can find the lines contain the date: https://regex101.com/r/nKW1W4/1
-(?P<date>.*?)-
Upvotes: 1
Views: 675
Reputation:
You can use:
input_text = """
~~~~~~~~~~~~~~~~~~~~~~~
| |
| First Block of text |
| |
~~~~~~~~~~~~~~~~~~~~~~~
----------------------- Monday 8 August 2021 -----------------------
~~~~~~~~~~~~~~~~~~~~~~~
| |
| Second Block of text |
| |
~~~~~~~~~~~~~~~~~~~~~~~
----------------------- Friday 12 August 2021 -----------------------
~~~~~~~~~~~~~~~~~~~~~~~
| |
| 3rd Block of text |
| |
~~~~~~~~~~~~~~~~~~~~~~~
----------------------- Friday 19 August 2021 -----------------------
~~~~~~~~~~~~~~~~~~~~~~~
| |
| 4th Block of text |
| |
~~~~~~~~~~~~~~~~~~~~~~~
"""
a = re.split(r'-+(.*?)-+', a)
for k, v in enumerate(a):
a[k] = a[k].strip()
print(a)
list comprehension which bit which is a bit more concise suggested by @fsimonjetz
input_text = """
~~~~~~~~~~~~~~~~~~~~~~~
| |
| First Block of text |
| |
~~~~~~~~~~~~~~~~~~~~~~~
----------------------- Monday 8 August 2021 -----------------------
~~~~~~~~~~~~~~~~~~~~~~~
| |
| Second Block of text |
| |
~~~~~~~~~~~~~~~~~~~~~~~
----------------------- Friday 12 August 2021 -----------------------
~~~~~~~~~~~~~~~~~~~~~~~
| |
| 3rd Block of text |
| |
~~~~~~~~~~~~~~~~~~~~~~~
----------------------- Friday 19 August 2021 -----------------------
~~~~~~~~~~~~~~~~~~~~~~~
| |
| 4th Block of text |
| |
~~~~~~~~~~~~~~~~~~~~~~~
"""
result = [x.strip() for x in re.split(r'-+(.*?)-+', input_text)]
Upvotes: 3
Reputation: 163207
In your pattern you are only matching a single -
at the left and right, and .*?
matches 0+ chars other than a newline non greedy.
That will give you a lot of partial matches instead of matching the whole line.
You might also use a match, and use capture group 1 for the filename and capture group 2 for the data.
^-+([^-]+)-+((?:\n(?!--).*)*)
Explanation
^
Start of string-+
Match 1+ times -
([^-]+)
Capture group 1 for the date part, match all chars except -
-+
Match 1+ times -
(
Capture group 2 for the data part
(?:\n(?!--).*)*
Match all lines that do not start with --
for example)
Close group 2For example
import re
pattern = r"^-+([^-]+)-+((?:\n(?!--).*)*)"
s = (" ~~~~~~~~~~~~~~~~~~~~~~~\n"
"| |\n"
"| First Block of text |\n"
"| |\n"
" ~~~~~~~~~~~~~~~~~~~~~~~\n\n"
"----------------------- Monday 8 August 2021 -----------------------\n\n"
" ~~~~~~~~~~~~~~~~~~~~~~~\n"
"| |\n"
"| Second Block of text |\n"
"| |\n"
" ~~~~~~~~~~~~~~~~~~~~~~~\n\n"
"----------------------- Friday 12 August 2021 -----------------------\n\n"
" ~~~~~~~~~~~~~~~~~~~~~~~\n"
"| |\n"
"| 3rd Block of text |\n"
"| |\n"
" ~~~~~~~~~~~~~~~~~~~~~~~\n"
" \n"
"----------------------- Friday 19 August 2021 -----------------------\n\n"
" ~~~~~~~~~~~~~~~~~~~~~~~\n"
"| |\n"
"| 4th Block of text |\n"
"| |\n"
" ~~~~~~~~~~~~~~~~~~~~~~~\n")
matches = re.findall(pattern, s, re.M)
if matches:
filename = matches[0][0].strip();
data = matches[0][1].strip();
print(filename)
print(data)
Output
Monday 8 August 2021
~~~~~~~~~~~~~~~~~~~~~~~
| |
| Second Block of text |
| |
~~~~~~~~~~~~~~~~~~~~~~~
Upvotes: 1
Reputation: 43169
You may match your blocks with the following expression and use the first group as the filename:
^
-+([^-]+)-+$
(.+?(?=^--|\Z))
See a demo on regex101.com (and mind the modifiers).
Upvotes: 3