konstunn
konstunn

Reputation: 355

Reading lines from gzipped text file in Python and get number of original compressed bytes read

I have many gzipped text files I want to decompress and read on the fly (online) and process so I can save disk space and also time reading data from disk at the expense of time of decompressing online.

So I use gzip module as well as tqdm to track progress.

But how can I find out the size of original uncompressed file size in order to set total bytes (uncompressed) count to read before finish to track the progress? As far as I've concerned from searching the web this problem is hard to tackle in gzip for files larger than 4 gigabytes which is my case.

Or alternatively I should track the count of compressed bytes read, having total bytes count set with the size of compressed file.

How can I achive that?

Here is the code example below with comments also reflecting what I'm trying to achieve.

I am using Python 3.5 .

import gzip
import tqdm
import os

size = os.path.getsize('filename.gz')
pbar = tqdm.tqdm(total=size, unit='b', unit_scale=True, unit_divisor=1024)

with gzip.open('filename.gz', 'rt') as file:
    for line in file:
        bytes_uncompressed = len(line.encode('utf-8'))
        # but how can I get compressed bytes read count?
        # bytes_compressed = ...?

        # pbar.update(bytes_compressed)

Upvotes: 4

Views: 7284

Answers (4)

Dash
Dash

Reputation: 1299

After trying to implement this myself, I found the simple solution (that's not made clear in the docs). You can access the underlying file object with gzippedfile.buffer.fileobj when opening as text, and gzippedfile.fileobj when opening as a binary file.

If you're iterating over the file, the position of the cursor using tell() will be the number of bytes read from the disk.

See the textIO wrapper doc for buffer usage and the gzip doc for fileobj

In your case, you could do something like:

with open('filename.gz', 'rt') as file:
    for line in file:
        pbar.update(file.buffer.fileobj.tell() - pbar.n)   # tqdm uses incremental update, so 
                                                   # increment is (current - last value)
        # Do things

And here's an example implementation of @Mark Adler's suggestion, if you really need access to the binary file

with open('filename.gz', 'rb') as f, gzip.open(f, 'rt') as file:
    for line in file:
        pbar.n = f.tell()  # Another way to set progress when we know total progress rather than increment
        pbar.update(0)   # Call refresh if needed
        # Do things

Upvotes: 5

Diya Li
Diya Li

Reputation: 1088

Here is what I did:

import gzip
import tqdm
import os


def _reader_generator(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024 * 1024)

def raw_newline_count_gzip(fname):
    f = gzip.open(fname, 'rb')
    f_gen = _reader_generator(f.read)
    return sum(buf.count(b'\n') for buf in f_gen)

num = raw_newline_count_gzip('filename.gz')

with gzip.open('filename.gz', 'rt') as file:
    with tqdm(total=num) as pbar:
        for line in file:
            bytes_uncompressed = len(line.encode('utf-8'))
            # do whatever you want

            pbar.update(1)

Hope this works on your file.

Upvotes: 0

Mark Adler
Mark Adler

Reputation: 112374

You have the answer in your question. Don't track the progress on uncompressed bytes. Track the progress on compressed bytes. They are roughly proportional to each other for a self-consistent compressed file, so you'll get the same effect. It is easy to find the size of the compressed file.

Upvotes: 2

Ondrej K.
Ondrej K.

Reputation: 9664

You should be open to read the underlying file (in binary mode) f = open('filename.gz', 'rb'). Then open gzip file on top of of that. g = gzip.GzipFile(fileobj=f). You perform your read operations from g, and to tell how far you are, you cat f.tell() ask for position in the compressed file.

EDIT2: BTW. of course you can also use tell() on the GzipFile instance to tell see how far along (bytes read) the uncompressed files you are.

EDIT: Now I see that is only partial answer to your problem. You'd also need the total. There I am afraid you are a bit out of luck. Esp. for files over 4GB as you've noted. gzip keeps uncompressed size in the last four bytes, so you could jump there and read them and jump back (GzipFile does not seem to expose this information itself), but since it's four bytes, you can only store 4GB as the biggest number, rest just gets truncated to the lower 4B of the value. In that case, I am afraid you won't know, until go to the end.

Anyways, above hint gives you current position compressed and uncompressed, hope that allows you to at least somewhat achieve what you've set out to do.

Upvotes: 6

Related Questions