Light_B
Light_B

Reputation: 1800

Reading file between headers in python

I have a large text file which have values separated by a header starting with "#". If the condition matches the one in the header I would like to read the file until the next header "#" and SKIP rest of the file.

To test that I'm trying to read the following text file named as test234.txt:

# abcdefgh
1fnrnf
mrkfr
nfoiernfr
nerfnr
# something
njndjen kj
ejkndjke
#vcrvr

The code I wrote is:

file_t = open('test234.txt')
cond = True
while cond:
    for line_ in file_t:
        print(line_)
        if file_t.read(1) == "#":
            cond = False
file_t.close()

But, the output I'm getting is:

# abcdefgh

fnrnf

rkfr

foiernfr

erfnr

something

jndjen kj

jkndjke

vcrvr

Instead I would like the output between two headers separated by "#" which is:

1fnrnf
mrkfr
nfoiernfr
nerfnr      

How can I do that? Thanks!

EDIT: Reading in file block by block using specified delimiter in python talks about reading file in groups separated by headers but I don't want to read all the headers. I only want to read the header where a given condition is met and as soon as the line reaches the next header marked by '#' it stops reading the file.

Upvotes: 4

Views: 1837

Answers (2)

hiro protagonist
hiro protagonist

Reputation: 46859

itertools.groupby can help:

from io import StringIO
from itertools import groupby

text = '''# abcdefgh
1fnrnf
mrkfr
nfoiernfr
nerfnr
# something
njndjen kj
ejkndjke
#vcrvr'''


with StringIO(text) as file:
    lines = (line.strip() for line in file)  # removing trailing '\n'
    for key, group in groupby(lines, key=lambda x: x[0]=='#'):

        if key is True:
            # found a line that starts with '#'
            print('found header: {}'.format(next(group)))

        if key is False:
            # group now contanins all lines that do not start with '#'
            print('\n'.join(group))

note that all of this is lazy. you'd only ever have all the items between two headers in memory.

you'd have to replace the with StringIO(text) as file: with; with open('test234.txt', 'r') as file:...

the output for your test is:

found header: # abcdefgh
1fnrnf
mrkfr
nfoiernfr
nerfnr
found header: # something
njndjen kj
ejkndjke
found header: #vcrvr

UPDATE as i misunderstood. here is a fresh attempt:

from io import StringIO
from collections import deque
from itertools import takewhile

from_line = '# abcdefgh'
to_line = '# something'

with StringIO(text) as file:
    lines = (line.strip() for line in file)  # removing trailing '\n'

    # fast-forward up to from_line
    deque(takewhile(lambda x: x != from_line, lines), maxlen=0)

    for line in takewhile(lambda x: x != to_line, lines):
        print(line)

where i use itertools.takewhile to get an iterator over the lines until a contition is met (until the first header is found in your case).

the deque part is just the consume pattern suggested in the itertools recipes. it just fast-forwards to the point where the given condition does not hold anymore.

Upvotes: 3

mujdecisy
mujdecisy

Reputation: 11

Learn and use regex. It will help you for all document signification processes.

import re #regex library

with open('test234.txt') as f:  #file stream
    lines = f.readlines()       #reads all lines

p = re.compile('^#.*')          #regex pattern creation

for l in lines:
    if p.match(l) == None:      #looks for non-matching lines
        print(l[:-2])

Upvotes: 1

Related Questions