Jprog
Jprog

Reputation: 79

How to determine the Content-Length of a gzipped file in Python?

I've a big compressed file and I want to know the size of the content without uncompress it. I've tried this:

import gzip
import os

with gzip.open(data_file) as f:
          f.seek(0, os.SEEK_END)
          size = f.tell()

but I get this error

ValueError: Seek from end not supported 

How can I do that?

Thx.

Upvotes: 4

Views: 4903

Answers (3)

Dan Lenski
Dan Lenski

Reputation: 79762

Unfortunately, the Python 2.x gzip module doesn't appear to support any way of determining uncompressed file size.

However, gzip does store the uncompressed file size as a little-endian 32-bit unsigned integer at the very end of the file: http://www.abeel.be/content/determine-uncompressed-size-gzip-file

Unfortunately, this only works for files <4gb in size due to using only a 32-bit integer the gzip format; see the manual.

import os
import struct

with open(data_file,"rb") as f:
    f.seek(-4, os.SEEK_END)
    size, = struct.unpack("<I", f.read(4))
    print size

Upvotes: 2

Jprog
Jprog

Reputation: 79

To summerize, I need to open huges compressed files (> 4GB) so the technique of Dan won't work and I want the length (number of line) of the file so the technique of Mark Adler is not appropriate.

Eventually, I found for uncompressed files a solution( not the most optimized but it works!) which can be transposed easily to compressed files:

size = 0

with gzip.open(data_file) as f:
  for line in f:
    size+= 1
    pass

return size

Thank you all, people in this forum are very effective!

Upvotes: -2

Mark Adler
Mark Adler

Reputation: 112374

It is not possible in principle to definitively determine the size of the uncompressed data in a gzip file without decompressing it. You do not need to have the space to store the uncompressed data -- you can discard it as you go along. But you have to decompress it all.

If you control the source of the gzip file and can assure that a) there are no concatenated members in the gzip file, b) the uncompressed data is less than 4 GB in length, and c) there is no extraneous junk at the end of the gzip file, then and only then you can read the last four bytes of the gzip file to get a little-endian integer that has the length of the uncompressed data.

See this answer for more details.

Here is Python code to read a gzip file and print the uncompressed length, without having to store or save the uncompressed data. It limits the memory usage to small buffers. This requires Python 3.3 or greater:

#!/usr/local/bin/python3.4
import sys
import zlib
import warnings
f = open(sys.argv[1], "rb")
total = 0
buf = f.read(1024)
while True:             # loop through concatenated gzip streams
    z = zlib.decompressobj(15+16)
    while True:         # loop through one gzip stream
        while True:     # go through all output from one input buffer
            total += len(z.decompress(buf, 4096))
            buf = z.unconsumed_tail
            if buf == b"":
                break
        if z.eof:
            break       # end of a gzip stream found
        buf = f.read(1024)
        if buf == b"":
            warnings.warn("incomplete gzip stream")
            break
    buf = z.unused_data
    z = None
    if buf == b"":
        buf == f.read(1024)
        if buf == b"":
            break
print(total)

Upvotes: 2

Related Questions