Reputation: 925
I have a text file that has blocks in following format
...some lines before this...
MY TEST MATRIX (ROWS)
0.5056E+03 0.8687E-03 -0.1202E-02
0.5056E+03 0.8687E-03 -0.1202E-02
MY TEST END
0.5056E+03 0.8687E-03 -0.1202E-02
0.3776E+03 0.8687E-03 0.1975E-04
STOP
---some lines after this
MY TEST MATRIX (ROWS)
2E+04 2E+04 0.8687E-03
2E+04 2E+04 0.8687E-03
MY TEST END
0.5056E+03 0.8687E-03 -0.1202E-02
0.5056E+03 0.8687E-03 -0.1202E-02
STOP
---some lines after this
---this repeats in txt file----
There are many such blocks and blocks appear in the text file at different places. I wanted to extract just the values that appear between MY TEST MATRIX (ROWS) and MY TEST END , MY TEST END and STOP to individual arrays lets call them firstvalue[] and secondvalue[].
For me one block is "MY TEST MATRIX- MY TEST END- STOP"
With a simple code like shown here I can read one block of data from text file. However since I have the blocks repeating in my text file I do not know how to capture data from each block in above two arrays.
import os
import sys
from math import *
firstValue = []
secondValue = []
checkFirst = False
checkSecond = False
filename="r3dmdtr2.txt"
with open(filename, "r") as infile:
for line in infile:
if line.strip().startswith("MY TEST MATRIX (ROWS)"):
checkFirst = True
if line.strip().startswith("MY TEST END"):
checkFirst = False
checkSecond = True
if line.strip().startswith("STOP"):
checkSecond = False
if checkFirst:
firstValue.append(line)
if checkSecond:
secondValue.append(line)
print(firstValue)
print (secondValue)
The above fragment reads one block of data perfectly. How can I parse all the repeating blocks in my text file and append them as an individual array into my firstValue[]
Something like:
firstvalue = [[values from first block],[Values from secondblock], so on... secondvalue = [[values from first block],[Values from secondblock], so on...
Upvotes: 2
Views: 739
Reputation: 103754
Given:
$ cat file.txt
...some lines before this...
MY TEST MATRIX (ROWS)
0.5056E+03 0.8687E-03 -0.1202E-02
0.5056E+03 0.8687E-03 -0.1202E-02
MY TEST END
0.5056E+03 0.8687E-03 -0.1202E-02
0.3776E+03 0.8687E-03 0.1975E-04
STOP
---some lines after this
MY TEST MATRIX (ROWS)
2E+04 2E+04 0.8687E-03
2E+04 2E+04 0.8687E-03
MY TEST END
0.5056E+03 0.8687E-03 -0.1202E-02
0.5056E+03 0.8687E-03 -0.1202E-02
STOP
---some lines after this
---this repeats in txt file----
In sed
, perl
or awk
you have the concept of a range regex to do something along the lines of:
$ sed -nE '/^MY TEST MATRIX/,/^MY TEST END/p' file.txt
MY TEST MATRIX (ROWS)
0.5056E+03 0.8687E-03 -0.1202E-02
0.5056E+03 0.8687E-03 -0.1202E-02
MY TEST END
MY TEST MATRIX (ROWS)
2E+04 2E+04 0.8687E-03
2E+04 2E+04 0.8687E-03
MY TEST END
You can replicate this functionality in Python with a FlipFlop class:
class FlipFlop:
''' Class to imitate the bahavior of /start/, /end/ flip flop in awk '''
def __init__(self, start_pattern, end_pattern):
self.patterns = start_pattern, end_pattern
self.state = False
def __call__(self, st):
ms=[e.search(st) for e in self.patterns]
if all(m for m in ms):
self.state = False
return True
rtr=True if self.state else False
if ms[self.state]:
self.state = not self.state
return self.state or rtr
Then capture the blocks as you read the file line-by-line:
di={}
blocks=[FlipFlop(re.compile(r'^MY TEST MATRIX \(ROWS\)'), re.compile(r'^MY TEST END')),
FlipFlop(re.compile(r'^MY TEST END'), re.compile(r'^STOP'))]
for i, ff in enumerate(blocks):
with open(fn) as f:
di[i]=[line.strip() for line in f if ff(line)]
Result:
>>> di
{0: ['MY TEST MATRIX (ROWS)',
'0.5056E+03 0.8687E-03 -0.1202E-02',
'0.5056E+03 0.8687E-03 -0.1202E-02',
'MY TEST END',
'MY TEST MATRIX (ROWS)',
'2E+04 2E+04 0.8687E-03',
'2E+04 2E+04 0.8687E-03',
'MY TEST END'],
1: ['MY TEST END',
'0.5056E+03 0.8687E-03 -0.1202E-02',
'0.3776E+03 0.8687E-03 0.1975E-04',
'STOP',
'MY TEST END',
'0.5056E+03 0.8687E-03 -0.1202E-02',
'0.5056E+03 0.8687E-03 -0.1202E-02',
'STOP']}
This does read the file twice to save memory; if speed is more important, you can just read the file into memory and iterate over that.
Upvotes: 0
Reputation: 12005
You can use re.findall
>>> import re
>>> data = open('file.txt').read()
>>> blocks = re.findall(r'MY TEST MATRIX \(ROWS\)\s*(.*?)\s+MY TEST END\s*(.*?)\s+STOP', data, re.DOTALL)
>>> first, second = zip(*blocks)
>>> print (first)
('2X+00 2X+00 1X+00 \n 2X+00 2X+00 1K+00', '2P+00 2X+00 1M+00 \n 2X+00 2Z+00 1K+00')
>>> print (second)
('2Y+00 2Y+00 1E+00 \n 2Y+00 2Z+00 1E+00', '2Y+00 2Y+00 1E+00 \n 2Y+00 2Z+00 1E+00')
Upvotes: 1