Reputation:
I am getting a bit of concerns about which way is the best to handle a file which has info that has to be isolated.
As example imagine a log file, which has data divided in blocks, and each block has a list of sub blocks.
Example of the log file:
data
data
data
data
block 1 start
-sub block 1 start
--data x
--data y
-sub block 1 end
-sub block 2 start
--data x
--data marked as good
--data z
-sub block 2 end
block 1 end
block 1 summary
block 2 start
-sub block 1 start
.....
-sub block 1 end
....
data
data
data
I am looking for an efficient way to parse the bigger file (which is various mb of text), isolate the blocks and then in each block check for a specific line in the sub blocks. If the line is in the sub block, I will save the block start and end lines, where the sub block belongs, and the sub block where the line is ( but will discard the other sub blocks that does not have the data). Until I hit the end of the file.
Example of how the results should look like:
block 1 start
-sub block 2 start
--data marked as good
-sub block 2 end
block 1 summary
.....
As now I am using this approach: I open the file, then I divide the file in smaller subset to work with; I have 3 lists that gather the info.
the first list, called List_general, will contain the results of the parsing in the whole log file, minus what is not related to the blocks that i need to isolate. Basically after this step I will have only the blocks as in the example above, minus the "data" lines. While I do this I check for the "good data" string, so if I see that string at least once, it means that there is data that I need to process and save, otherwise I just end the function.
If there is data to process, I go line by line in list_general and start to isolate each block and sub-blocks. starting from the first block (so from block 1 start to block 1 summary, if you look at the example).
Once that I hit the end of a block (block 1 summary) ; if there is the data marked as good, I will start to parse it, going trough each sub block to find which one has the good data.
I will copy line by line of each sub block, like I did for the blocks (basically starting to copy line by line from "sub block 1 start" to "sub block 1 end") and check if the good data is in that sub block. If it is I will copy the list content to the final list, otehrwise I will delete the list and start with the next sub block.
I know that this mechanism of parsing each section is very cumbersome and expensive resource wise; so I was wondering if there is a "better" way to do this. I am pretty new to python so I am not sure how the approach to a similar issue may be faced. Hopefully someone here had a similar issue so can suggest me the best way to face this issue.
Upvotes: 4
Views: 805
Reputation: 2040
For log files I'd throw away lines that I don't care when I parse the file, stuffing anything useful in sqlite (check the module sqlite3). Then do the reporting/processing once I'm done parsing the file.
Sqlite can be configured to use disk or memory as storage - so you can choose according to your needs.
I like this approach is that it's flexible and I do not need to parse anything twice.
Added: Something similar to this?
class Parser:
def __init__(self, logfile):
self.log = open(logfile)
self.logentry = []
def next(self):
found = False
for line in self.log:
self.logentry.append(line)
if <block ends>:
e = '\n'.join(self.logentry)
self.logentry = []
yield e
Upvotes: 1
Reputation: 74098
If you can identify block or sub block boundaries with just block ... start
and block ... end
, you can process each block as you read and store the result wherever you need it.
Upvotes: 0