(Python) Best way to parse a file to avoid performances issues

Question

I am getting a bit of concerns about which way is the best to handle a file which has info that has to be isolated.

As example imagine a log file, which has data divided in blocks, and each block has a list of sub blocks.

Example of the log file:

data
data
data
data 
   block 1 start
    -sub block 1 start
    --data x
    --data y
    -sub block 1 end
    -sub block 2 start
    --data x
    --data marked as good
    --data z
    -sub block 2 end
    block 1 end
    block 1 summary

    block 2 start
    -sub block 1 start
    .....
    -sub block 1 end
    ....
data
data
data

I am looking for an efficient way to parse the bigger file (which is various mb of text), isolate the blocks and then in each block check for a specific line in the sub blocks. If the line is in the sub block, I will save the block start and end lines, where the sub block belongs, and the sub block where the line is ( but will discard the other sub blocks that does not have the data). Until I hit the end of the file.

Example of how the results should look like:

block 1 start
-sub block 2 start
--data marked as good
-sub block 2 end
block 1 summary
.....

As now I am using this approach: I open the file, then I divide the file in smaller subset to work with; I have 3 lists that gather the info.

the first list, called List_general, will contain the results of the parsing in the whole log file, minus what is not related to the blocks that i need to isolate. Basically after this step I will have only the blocks as in the example above, minus the "data" lines. While I do this I check for the "good data" string, so if I see that string at least once, it means that there is data that I need to process and save, otherwise I just end the function.

If there is data to process, I go line by line in list_general and start to isolate each block and sub-blocks. starting from the first block (so from block 1 start to block 1 summary, if you look at the example).

Once that I hit the end of a block (block 1 summary) ; if there is the data marked as good, I will start to parse it, going trough each sub block to find which one has the good data.

I will copy line by line of each sub block, like I did for the blocks (basically starting to copy line by line from "sub block 1 start" to "sub block 1 end") and check if the good data is in that sub block. If it is I will copy the list content to the final list, otehrwise I will delete the list and start with the next sub block.

I know that this mechanism of parsing each section is very cumbersome and expensive resource wise; so I was wondering if there is a "better" way to do this. I am pretty new to python so I am not sure how the approach to a similar issue may be faced. Hopefully someone here had a similar issue so can suggest me the best way to face this issue.

Lester Cheung · Accepted Answer

For log files I'd throw away lines that I don't care when I parse the file, stuffing anything useful in sqlite (check the module sqlite3). Then do the reporting/processing once I'm done parsing the file.

Sqlite can be configured to use disk or memory as storage - so you can choose according to your needs.

I like this approach is that it's flexible and I do not need to parse anything twice.

Added: Something similar to this?

class Parser:
    def __init__(self, logfile):
        self.log = open(logfile)
        self.logentry = []
    def next(self):
        found = False
        for line in self.log:
            self.logentry.append(line)            
            if :
                e = '
'.join(self.logentry)
                self.logentry = []
                yield e

(Python) Best way to parse a file to avoid performances issues

Answers (2)

Related Questions