Reputation: 137
This might be a simple question but I can't seem to find the answer to this or why it is not working on this specific case.
I want to read large files, they can be compressed or not. I used contextlib
to write a contextmanager
function to handle this. Then using the with
statement I read the files in the main script.
My problem here is that the script uses a lot of memory then gets killed (testing using a compressed file). What am I doing wrong? Should I approach this differently?
def process_vcf(location):
logging.info('Processing vcf')
logging.debug(location)
with read_compressed_or_not(location) as vcf:
for line in vcf.readlines():
if line.startswith('#'):
logging.debug(line)
@contextmanager
def read_compressed_or_not(location):
if location.endswith('.gz'):
try:
file = gzip.open(location)
yield file
finally:
file.close()
else:
try:
file = open(location, 'r')
yield file
finally:
file.close()
Upvotes: 0
Views: 531
Reputation: 19250
The file opening function is the main difference between reading a gzip file and a non-gzip file. So one can dynamically assign the opener and then read the file. Then there is no need for a custom context manager.
import gzip
open_fn = gzip.open if location.endswith(".gz") else open
with open_fn(location, mode="rt") as vcf:
for line in vcf:
...
Upvotes: 1
Reputation: 54668
The lowest impact solution is just to skip the use of the readlines
function. readlines
returns a list containing every line in the file, so it does have the entire file in memory. Using the filename by itself reads one line at a time using a generator, so it doesn't have to have the whole file in memory.
with read_compressed_or_not(location) as vcf:
for line in vcf:
if line.startswith('#'):
logging.debug(line)
Upvotes: 3
Reputation: 1627
Instead of using for line in vcf.readlines()
, you can do:
line = vcf.readline()
while line:
# Do stuff
line = vcf.readline()
This will only load one single line into memory at once
Upvotes: 1