Reputation: 355
I have many gzipped text files I want to decompress and read on the fly (online) and process so I can save disk space and also time reading data from disk at the expense of time of decompressing online.
So I use gzip module as well as tqdm to track progress.
But how can I find out the size of original uncompressed file size in order to set total bytes (uncompressed) count to read before finish to track the progress? As far as I've concerned from searching the web this problem is hard to tackle in gzip for files larger than 4 gigabytes which is my case.
Or alternatively I should track the count of compressed bytes read, having total bytes count set with the size of compressed file.
How can I achive that?
Here is the code example below with comments also reflecting what I'm trying to achieve.
I am using Python 3.5 .
import gzip
import tqdm
import os
size = os.path.getsize('filename.gz')
pbar = tqdm.tqdm(total=size, unit='b', unit_scale=True, unit_divisor=1024)
with gzip.open('filename.gz', 'rt') as file:
for line in file:
bytes_uncompressed = len(line.encode('utf-8'))
# but how can I get compressed bytes read count?
# bytes_compressed = ...?
# pbar.update(bytes_compressed)
Upvotes: 4
Views: 7284
Reputation: 1299
After trying to implement this myself, I found the simple solution (that's not made clear in the docs). You can access the underlying file object with gzippedfile.buffer.fileobj
when opening as text, and gzippedfile.fileobj
when opening as a binary file.
If you're iterating over the file, the position of the cursor using tell()
will be the number of bytes read from the disk.
See the textIO wrapper doc for buffer
usage and the gzip doc for fileobj
In your case, you could do something like:
with open('filename.gz', 'rt') as file:
for line in file:
pbar.update(file.buffer.fileobj.tell() - pbar.n) # tqdm uses incremental update, so
# increment is (current - last value)
# Do things
And here's an example implementation of @Mark Adler's suggestion, if you really need access to the binary file
with open('filename.gz', 'rb') as f, gzip.open(f, 'rt') as file:
for line in file:
pbar.n = f.tell() # Another way to set progress when we know total progress rather than increment
pbar.update(0) # Call refresh if needed
# Do things
Upvotes: 5
Reputation: 1088
Here is what I did:
import gzip
import tqdm
import os
def _reader_generator(reader):
b = reader(1024 * 1024)
while b:
yield b
b = reader(1024 * 1024)
def raw_newline_count_gzip(fname):
f = gzip.open(fname, 'rb')
f_gen = _reader_generator(f.read)
return sum(buf.count(b'\n') for buf in f_gen)
num = raw_newline_count_gzip('filename.gz')
with gzip.open('filename.gz', 'rt') as file:
with tqdm(total=num) as pbar:
for line in file:
bytes_uncompressed = len(line.encode('utf-8'))
# do whatever you want
pbar.update(1)
Hope this works on your file.
Upvotes: 0
Reputation: 112374
You have the answer in your question. Don't track the progress on uncompressed bytes. Track the progress on compressed bytes. They are roughly proportional to each other for a self-consistent compressed file, so you'll get the same effect. It is easy to find the size of the compressed file.
Upvotes: 2
Reputation: 9664
You should be open to read the underlying file (in binary mode) f = open('filename.gz', 'rb')
. Then open gzip file on top of of that. g = gzip.GzipFile(fileobj=f)
. You perform your read operations from g
, and to tell how far you are, you cat f.tell()
ask for position in the compressed file.
EDIT2: BTW. of course you can also use tell()
on the GzipFile
instance to tell see how far along (bytes read) the uncompressed files you are.
EDIT: Now I see that is only partial answer to your problem. You'd also need the total. There I am afraid you are a bit out of luck. Esp. for files over 4GB as you've noted. gzip keeps uncompressed size in the last four bytes, so you could jump there and read them and jump back (GzipFile
does not seem to expose this information itself), but since it's four bytes, you can only store 4GB as the biggest number, rest just gets truncated to the lower 4B of the value. In that case, I am afraid you won't know, until go to the end.
Anyways, above hint gives you current position compressed and uncompressed, hope that allows you to at least somewhat achieve what you've set out to do.
Upvotes: 6