Eric Sauer
Eric Sauer

Reputation: 870

Determine how many files are in a zip

I am trying to read a zip file (in python 2.7.2) by reading each of the bytes. I am able to get through the local file headers and the data. However I am stuck when trying to read the Central file header.

This helps alot http://en.wikipedia.org/wiki/File:ZIP-64_Internal_Layout.svg

I dont know how to find out how many items there are in the archive so I can switch to formating the central file header or how else to know how to switch from formating file to the central file header.

This is what I have right now -

import sys

def main(debug=0,arg_file=''):
    if debug==2:
        print "- Opening %s" % arg_file
    with open(arg_file) as archive: 
        if debug==2:
            print "- Reading %s" % arg_file

        bytes = archive.read()
        if debug==2:
            print "-------------Binary-------------"
            print bytes

        #Read file headers
        end = 0
        while end != bytes.__len__():
            print end
            end = process_sub_file(debug,end,bytes)

def process_sub_file(debug,startbytes, bytes): 
    header = bytes[startbytes + 0] + bytes[startbytes + 1] + bytes[startbytes + 2] + bytes[startbytes + 3]
    version = bytes[startbytes + 4] + bytes[startbytes + 5]
    flags = bytes[startbytes + 6] + bytes[startbytes + 7]
    comp_method = bytes[startbytes + 8] + bytes[startbytes + 9]
    mod_time = bytes[startbytes + 10] + bytes[startbytes + 11]
    mod_date = bytes[startbytes + 12] + bytes[startbytes + 13]
    crc = bytes[startbytes + 14] + bytes[startbytes + 15] + bytes[startbytes + 16] + bytes[startbytes + 17]
    comp_size_bytes = bytes[startbytes + 18] + bytes[startbytes + 19] + bytes[startbytes + 20] + bytes[startbytes + 21]
    comp_size = ord(comp_size_bytes[0]) + ord(comp_size_bytes[1]) + ord(comp_size_bytes[2]) + ord(comp_size_bytes[3])
    uncomp_size_bytes = bytes[startbytes + 22] + bytes[startbytes + 23] + bytes[startbytes + 24] + bytes[startbytes + 25]
    uncomp_size = ord(uncomp_size_bytes[0]) + ord(uncomp_size_bytes[1]) + ord(uncomp_size_bytes[2]) + ord(uncomp_size_bytes[3])
    name_len_bytes = bytes[startbytes + 26] + bytes[startbytes + 27]
    name_len = int(ord(name_len_bytes[0])+ord(name_len_bytes[1]))
    extra_len_bytes = bytes[startbytes + 28] + bytes[startbytes + 29]
    extra_len = int(ord(extra_len_bytes[0])+ord(extra_len_bytes[1]))
    file_name = ""
    for i in range(name_len):
        file_name = file_name + bytes[startbytes + 30 + i]
    extra_field = "" 
    for i in range(extra_len):
        file_name = file_name + bytes[startbytes + 30 + name_len + i]
    data = ""
    for i in range(comp_size):
        data = data + bytes[startbytes + 30 + name_len + extra_len + i]
    if debug>=1:
        print "-------------Header-------------"
        print "Header Signature: %s" % header
        print "Version: %s" % version
        print "Flags: %s" % flags
        print "Compression Method: %s" % comp_method
        print "Modification Time: %s" % (ord(mod_time[0]) + ord(mod_time[1]))
        print "Modification Date: %s" % (ord(mod_date[0]) + ord(mod_time[1]))
        print "CRC-32: %s" % crc
        print "Compressed Size: %s" % comp_size
        print "Uncompressed Size: %s" % uncomp_size
        print "File Name Length: %s" % name_len
        print "Extra Field Length: %s" % extra_len
        print "File Name: %s" % file_name
        print "Extra Field: %s" % extra_field
        print "Data:\n%s" % data
    return startbytes + 30 + name_len + extra_len + comp_size

Upvotes: 1

Views: 1301

Answers (1)

Nathan Moinvaziri
Nathan Moinvaziri

Reputation: 5638

You want to search through the file backwards for the "End of Central Directory" block. It contains the total number of entries in the central directory.

Search for "End of central directory record:" in: http://www.pkware.com/documents/casestudies/APPNOTE.TXT

If the total number of entries in the central directory = 0xffff, then you have to search for the "Zip64 End of Central Directory" block which is located directly before the "End of Central Directory" block. And in that case the Zip64 block would contain the actual number of entries in the central directory for the zip file.

The "EofCD" block contains the offset to the start of the central directory which you can then go to, to begin iterating through all the file header blocks in the entire central directory.

Upvotes: 1

Related Questions