user393267
user393267

Reputation:

(Python) Best way to parse a file to avoid performances issues

I am getting a bit of concerns about which way is the best to handle a file which has info that has to be isolated.

As example imagine a log file, which has data divided in blocks, and each block has a list of sub blocks.

Example of the log file:

data
data
data
data 
   block 1 start
    -sub block 1 start
    --data x
    --data y
    -sub block 1 end
    -sub block 2 start
    --data x
    --data marked as good
    --data z
    -sub block 2 end
    block 1 end
    block 1 summary

    block 2 start
    -sub block 1 start
    .....
    -sub block 1 end
    ....
data
data
data

I am looking for an efficient way to parse the bigger file (which is various mb of text), isolate the blocks and then in each block check for a specific line in the sub blocks. If the line is in the sub block, I will save the block start and end lines, where the sub block belongs, and the sub block where the line is ( but will discard the other sub blocks that does not have the data). Until I hit the end of the file.

Example of how the results should look like:

block 1 start
-sub block 2 start
--data marked as good
-sub block 2 end
block 1 summary
.....

As now I am using this approach: I open the file, then I divide the file in smaller subset to work with; I have 3 lists that gather the info.

the first list, called List_general, will contain the results of the parsing in the whole log file, minus what is not related to the blocks that i need to isolate. Basically after this step I will have only the blocks as in the example above, minus the "data" lines. While I do this I check for the "good data" string, so if I see that string at least once, it means that there is data that I need to process and save, otherwise I just end the function.

If there is data to process, I go line by line in list_general and start to isolate each block and sub-blocks. starting from the first block (so from block 1 start to block 1 summary, if you look at the example).

Once that I hit the end of a block (block 1 summary) ; if there is the data marked as good, I will start to parse it, going trough each sub block to find which one has the good data.

I will copy line by line of each sub block, like I did for the blocks (basically starting to copy line by line from "sub block 1 start" to "sub block 1 end") and check if the good data is in that sub block. If it is I will copy the list content to the final list, otehrwise I will delete the list and start with the next sub block.

I know that this mechanism of parsing each section is very cumbersome and expensive resource wise; so I was wondering if there is a "better" way to do this. I am pretty new to python so I am not sure how the approach to a similar issue may be faced. Hopefully someone here had a similar issue so can suggest me the best way to face this issue.

Upvotes: 4

Views: 805

Answers (2)

Lester Cheung
Lester Cheung

Reputation: 2040

For log files I'd throw away lines that I don't care when I parse the file, stuffing anything useful in sqlite (check the module sqlite3). Then do the reporting/processing once I'm done parsing the file.

Sqlite can be configured to use disk or memory as storage - so you can choose according to your needs.

I like this approach is that it's flexible and I do not need to parse anything twice.

Added: Something similar to this?

class Parser:
    def __init__(self, logfile):
        self.log = open(logfile)
        self.logentry = []
    def next(self):
        found = False
        for line in self.log:
            self.logentry.append(line)            
            if <block ends>:
                e = '\n'.join(self.logentry)
                self.logentry = []
                yield e

Upvotes: 1

Olaf Dietsche
Olaf Dietsche

Reputation: 74098

If you can identify block or sub block boundaries with just block ... start and block ... end, you can process each block as you read and store the result wherever you need it.

Upvotes: 0

Related Questions