Moopsish
Moopsish

Reputation: 137

Reading large compressed files

This might be a simple question but I can't seem to find the answer to this or why it is not working on this specific case.

I want to read large files, they can be compressed or not. I used contextlib to write a contextmanager function to handle this. Then using the with statement I read the files in the main script.

My problem here is that the script uses a lot of memory then gets killed (testing using a compressed file). What am I doing wrong? Should I approach this differently?

def process_vcf(location):
    logging.info('Processing vcf')
    logging.debug(location)
    with read_compressed_or_not(location) as vcf:
        for line in vcf.readlines():
            if line.startswith('#'):
                logging.debug(line)

@contextmanager
def read_compressed_or_not(location):
    if location.endswith('.gz'):
        try: 
            file = gzip.open(location)
            yield file
        finally:
            file.close()
    else:
        try: 
            file = open(location, 'r')
            yield file
        finally:
            file.close()

Upvotes: 0

Views: 531

Answers (3)

jkr
jkr

Reputation: 19250

The file opening function is the main difference between reading a gzip file and a non-gzip file. So one can dynamically assign the opener and then read the file. Then there is no need for a custom context manager.

import gzip

open_fn = gzip.open if location.endswith(".gz") else open
with open_fn(location, mode="rt") as vcf:
    for line in vcf:
        ...

Upvotes: 1

Tim Roberts
Tim Roberts

Reputation: 54668

The lowest impact solution is just to skip the use of the readlines function. readlines returns a list containing every line in the file, so it does have the entire file in memory. Using the filename by itself reads one line at a time using a generator, so it doesn't have to have the whole file in memory.

    with read_compressed_or_not(location) as vcf:
        for line in vcf:
            if line.startswith('#'):
                logging.debug(line)

Upvotes: 3

Erik McKelvey
Erik McKelvey

Reputation: 1627

Instead of using for line in vcf.readlines(), you can do:

line = vcf.readline()
while line:
    # Do stuff
    line = vcf.readline()

This will only load one single line into memory at once

Upvotes: 1

Related Questions