Extracting blocks from a text file

Question

I have a text file that has blocks in following format

...some lines before this...
MY TEST MATRIX (ROWS)
 0.5056E+03  0.8687E-03 -0.1202E-02 
 0.5056E+03  0.8687E-03 -0.1202E-02 
MY TEST END
 0.5056E+03  0.8687E-03 -0.1202E-02  
 0.3776E+03  0.8687E-03  0.1975E-04  
STOP
---some lines after this
MY TEST MATRIX (ROWS)
 2E+04  2E+04  0.8687E-03  
 2E+04  2E+04  0.8687E-03
MY TEST END
 0.5056E+03  0.8687E-03 -0.1202E-02 
 0.5056E+03  0.8687E-03 -0.1202E-02 
STOP
---some lines after this
---this repeats in txt file----

There are many such blocks and blocks appear in the text file at different places. I wanted to extract just the values that appear between MY TEST MATRIX (ROWS) and MY TEST END , MY TEST END and STOP to individual arrays lets call them firstvalue[] and secondvalue[].

For me one block is "MY TEST MATRIX- MY TEST END- STOP"

With a simple code like shown here I can read one block of data from text file. However since I have the blocks repeating in my text file I do not know how to capture data from each block in above two arrays.

    import os
    import sys
    from math import *
    firstValue = []
    secondValue = []
    checkFirst = False
    checkSecond = False
    filename="r3dmdtr2.txt"
    with open(filename, "r") as infile:

        for line in infile:
            if line.strip().startswith("MY TEST MATRIX (ROWS)"):
                checkFirst = True
            if line.strip().startswith("MY TEST END"):
                checkFirst = False
                checkSecond = True
            if line.strip().startswith("STOP"):
                checkSecond = False  

            if checkFirst:
                firstValue.append(line) 

            if checkSecond:
                secondValue.append(line)          

    print(firstValue)
    print (secondValue)

The above fragment reads one block of data perfectly. How can I parse all the repeating blocks in my text file and append them as an individual array into my firstValue[]

Something like:

firstvalue = [[values from first block],[Values from secondblock], so on... secondvalue = [[values from first block],[Values from secondblock], so on...

dawg · Accepted Answer

Given:

$ cat file.txt
...some lines before this...
MY TEST MATRIX (ROWS)
 0.5056E+03  0.8687E-03 -0.1202E-02 
 0.5056E+03  0.8687E-03 -0.1202E-02 
MY TEST END
 0.5056E+03  0.8687E-03 -0.1202E-02  
 0.3776E+03  0.8687E-03  0.1975E-04  
STOP
---some lines after this
MY TEST MATRIX (ROWS)
 2E+04  2E+04  0.8687E-03  
 2E+04  2E+04  0.8687E-03
MY TEST END
 0.5056E+03  0.8687E-03 -0.1202E-02 
 0.5056E+03  0.8687E-03 -0.1202E-02 
STOP
---some lines after this
---this repeats in txt file----

In sed, perl or awk you have the concept of a range regex to do something along the lines of:

$ sed -nE '/^MY TEST MATRIX/,/^MY TEST END/p' file.txt
MY TEST MATRIX (ROWS)
 0.5056E+03  0.8687E-03 -0.1202E-02 
 0.5056E+03  0.8687E-03 -0.1202E-02 
MY TEST END
MY TEST MATRIX (ROWS)
 2E+04  2E+04  0.8687E-03  
 2E+04  2E+04  0.8687E-03
MY TEST END

You can replicate this functionality in Python with a FlipFlop class:

class FlipFlop: 
    ''' Class to imitate the bahavior of /start/, /end/ flip flop in awk '''
    def __init__(self, start_pattern, end_pattern):
        self.patterns = start_pattern, end_pattern
        self.state = False
    def __call__(self, st):
        ms=[e.search(st) for e in self.patterns]
        if all(m for m in ms):
            self.state = False
            return True
        rtr=True if self.state else False
        if ms[self.state]:
            self.state = not self.state
        return self.state or rtr

Then capture the blocks as you read the file line-by-line:

di={}
blocks=[FlipFlop(re.compile(r'^MY TEST MATRIX $ROWS$'), re.compile(r'^MY TEST END')),
        FlipFlop(re.compile(r'^MY TEST END'), re.compile(r'^STOP'))]
for i, ff in enumerate(blocks):         
    with open(fn) as f:
        di[i]=[line.strip() for line in f if ff(line)]

Result:

>>> di
{0: ['MY TEST MATRIX (ROWS)', 
     '0.5056E+03  0.8687E-03 -0.1202E-02', 
     '0.5056E+03  0.8687E-03 -0.1202E-02', 
     'MY TEST END', 
     'MY TEST MATRIX (ROWS)', 
     '2E+04  2E+04  0.8687E-03', 
     '2E+04  2E+04  0.8687E-03', 
     'MY TEST END'], 
 1: ['MY TEST END', 
     '0.5056E+03  0.8687E-03 -0.1202E-02', 
     '0.3776E+03  0.8687E-03  0.1975E-04', 
     'STOP', 
     'MY TEST END', 
     '0.5056E+03  0.8687E-03 -0.1202E-02', 
     '0.5056E+03  0.8687E-03 -0.1202E-02', 
     'STOP']}

This does read the file twice to save memory; if speed is more important, you can just read the file into memory and iterate over that.

Extracting blocks from a text file

Answers (2)

Related Questions