NosIreland
NosIreland

Reputation: 91

Python continue reading file

I have a text file that has info divided in blocks in the following format:

start1
loads of text
end1
start2
loads of text
end2

What I need to do is to look for start of the block and then parse the text inside the block until the end of it. My understanding(probably wrong) is that I need to have 2 for loops. First looks for start of the block and then the second one parses the info in the block. I cannot figure out how do I make the second loop start from the line that the first loop finished with? Whatever I do it it always starts from the beginning of the file. Here is snippet of what I have.

for line in s:
    if "start1" in line:
        print("started")
        ...second for loop...
    elif "end1" in line:
        print("finished")

Upvotes: 2

Views: 3237

Answers (6)

tdelaney
tdelaney

Reputation: 77347

Its easy... you can continue using the same iterator. The big problem is that your start and end delimiters aren't unique. I don't know if that's just your cooked up example or if there is more to it. The thing about delimiters is that they need to be predictable and they can't also reside in the code that is being delimited.

Assuming that you don't care about the delimiter part yet... this will go through the file. Note that you need a common iterator to make this go:

iter_s = iter(s)
for line in iter_s:
    if "start1" in line:
        print("started")
        for line in iter_s:
            if "end1" in line:
                print("finished")
            else:
                print("got a line")

UPDATE

My original code worked for files but not for lists. I changed it to grab an iterator before entering the for loop. There was a question about why iter_s = iter(s) was needed to get this to work. In fact, its not needed for all objects. Suppose s is a file object. File objects act as their own iterator, so you can get as many as you want, they are really the same file object and each will grab the next line.

>>> f=open('deleteme.txt', 'w')
>>> iter_f = iter(f)
>>> id(iter_f) == id(f)
True
>>> type(f)
<class '_io.TextIOWrapper'>
>>> type(iter_f)
<class '_io.TextIOWrapper'>
>>> f.close()

Other sequences define their own iterators that work independently. So, for a list, each iterator will start from the top. In this case, each iterator is like a separate cursor in the list.

>>> l=[]
>>> iter_l = iter(l)
>>> id(iter_l) == id(l)
False
>>> type(l)
<class 'list'>
>>> type(iter_l)
<class 'list_iterator'>

When a for loop starts, it gets an iterator for its object and then runs through it. If its object is already an iterator, it just uses it. That's why grabbing an iterator first works.

To make sure you work with both type of sequences, grab an iterator.

Upvotes: 2

Filippo Costa
Filippo Costa

Reputation: 498

EDIT: not what the OP what looking for. This is the correct solution:

# One of the most versatile built-in Python libraries for string manipulation.
import re

text = "your text here"

start = -1
end = 0

# enumerate() allows you to get both indexes and lines
for i, line in enumerate(text.splitlines()):

    if re.search("start[1-9][0-9]*", line) and start < end:
        start = i

    elif re.search("end[1-9][0-9]*", line) and end < start:
        end = i
        myparser("\n".join(text.splitlines()[start+1:end]))

def myparser(string):
    ...

Here you will find more infos about re.

Upvotes: 0

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210842

I saw in your comment that you are going to use RegEx's for parsing the blocks... So why don't you want to use RegEx's to parse blocks:

from __future__ import absolute_import

import re


def parse_blocks(txt, blk_begin_re=r'start[\d]*', blk_end_re=r'end[\d]*', re_flags=re.I | re.M):
    """
    parse text 'txt' into blocks, beginning with 'blk_begin_re' RegEx
        and ending with 'blk_end_re' RegEx

    returns tuple(parsed_block_begin, parsed_block, parsed_block_end)
    """
    pattern = r'({0})(.*?)({1})'.format(blk_begin_re, blk_end_re)
    return re.findall(pattern, txt, re_flags)

# read file into 'data' variable
with open('text.txt', 'r') as f:
    data = f.read()

# list all parsed blocks
for blk_begin, blk, blk_end in parse_blocks(data, r'start[\d]*', r'end[\d]*', re.I | re.S):
    # print line separator
    print('=' * 60)
    print('started block: [{}]'.format(blk_begin))
    print(blk)
    print('ended block: [{}]'.format(blk_end))

Upvotes: 0

willnx
willnx

Reputation: 1283

Depending on what you want to do with the data, something like this might be useful.

def readit(filepath):
    with open(filepath) as thefile:
        data = []
        sentinel= 'end1'
        for line in thefile:
            if line.startswith('start'):
                sentinel= 'end' + line.rstrip()[-1] #the last char (without the newline)
            elif line.rstrip() == sentinel:  # again, the rstrip is to drop the newline char
                yield data
                data = []
            else:
                data.append(line)

This is a generator that returns all the data between the 'start' and 'end' values every time you call it.

You'd use it like this:

>>> generator = readit()
>>> next(generator)
['loads of text\n']
>>> next(generator)
['more text\n']

Here's what my data file looked like:

start1
loads of text
end1
start2
more text
end2

Upvotes: 0

Cosinux
Cosinux

Reputation: 321

Is this helpful?

filename = "file to open"
with open(filename) as f:
    for line in f:
        if line == "start":
            print("started")
        elif line == "end":
            print("finished")
        else:
            print("this is just an ordinary text")
            # Do whatever here

Upvotes: 0

Jacob H
Jacob H

Reputation: 876

You want to be using a while loop for this:

line = file.readLine()
while line != '':
    if "start1" in line:
        print("started")
        while not "end1" in line and line != '':
            print("Read a line.")
            line = file.readLine()
        print("Finished")

This should give the expected results.

Upvotes: 0

Related Questions