Reputation: 166
I am trying to parse a large file, line by line, for relevant information. I may be receiving either an uncompressed or gzipped file (I may have to edit for zip file at a later stage).
I am using the following code but I feel that, because I am not inside the with
statement, I am not parsing the file line by line and am in fact loading the entire file file_content
into memory.
if ".gz" in FILE_LIST['INPUT_FILE']:
with gzip.open(FILE_LIST['INPUT_FILE']) as input_file:
file_content = input_file.readlines()
else:
with open(FILE_LIST['INPUT_FILE']) as input_file:
file_content = input_file.readlines()
for line in file_content:
# do stuff
Any suggestions for how I should handle this? I would prefer not to unzip the file outside the code block, as this needs to be generic, and I would have to tidy up multiple files.
Upvotes: 5
Views: 418
Reputation: 140297
readlines
reads the file fully. So it's a no-go for big files.
Doing 2 context blocks like you're doing and then using the input_file
handle outside them doesn't work (operation on closed file).
To get best of both worlds, I would use a ternary conditional for the context block (which determines if open
or gzip.open
must be used), then iterate on the lines.
open_function = gzip.open if ".gz" in FILE_LIST['INPUT_FILE'] else open
with open_function(FILE_LIST['INPUT_FILE'],"r") as input_file:
for line in input_file:
note that I have added the "r" mode to make sure to work on text not on binary (gzip.open
defaults to binary)
Alternative: open_function
can be made generic so it doesn't depend on FILE_LIST['INPUT_FILE']
:
open_function = lambda f: gzip.open(f,"r") if ".gz" in f else open(f)
once defined, you can reuse it at will
with open_function(FILE_LIST['INPUT_FILE']) as input_file:
for line in input_file:
Upvotes: 5