AllynH
AllynH

Reputation: 166

Parsing large, possibly compressed, files in Python

I am trying to parse a large file, line by line, for relevant information. I may be receiving either an uncompressed or gzipped file (I may have to edit for zip file at a later stage).

I am using the following code but I feel that, because I am not inside the with statement, I am not parsing the file line by line and am in fact loading the entire file file_content into memory.

if ".gz" in FILE_LIST['INPUT_FILE']:
    with gzip.open(FILE_LIST['INPUT_FILE']) as input_file:
        file_content = input_file.readlines()
else:
    with open(FILE_LIST['INPUT_FILE']) as input_file:
        file_content = input_file.readlines()

for line in file_content:
    # do stuff

Any suggestions for how I should handle this? I would prefer not to unzip the file outside the code block, as this needs to be generic, and I would have to tidy up multiple files.

Upvotes: 5

Views: 418

Answers (1)

Jean-François Fabre
Jean-François Fabre

Reputation: 140297

readlines reads the file fully. So it's a no-go for big files.

Doing 2 context blocks like you're doing and then using the input_file handle outside them doesn't work (operation on closed file).

To get best of both worlds, I would use a ternary conditional for the context block (which determines if open or gzip.open must be used), then iterate on the lines.

open_function = gzip.open if ".gz" in FILE_LIST['INPUT_FILE'] else open
with open_function(FILE_LIST['INPUT_FILE'],"r") as input_file:
    for line in input_file:

note that I have added the "r" mode to make sure to work on text not on binary (gzip.open defaults to binary)

Alternative: open_function can be made generic so it doesn't depend on FILE_LIST['INPUT_FILE']:

open_function = lambda f: gzip.open(f,"r") if ".gz" in f else open(f)

once defined, you can reuse it at will

with open_function(FILE_LIST['INPUT_FILE']) as input_file:
    for line in input_file:

Upvotes: 5

Related Questions