Reputation: 12490
I am dealing with a somewhat large binary file (717M). This binary file contains a set (unknown number!) of complete zip files.
I would like to extract all of those zip files (no need to explitly decompress them). I am able to find the offset (start point) of each chunks thanks to the magic number ('PK') but I fail to find a way to compute the length for each chunk (eg. to carve those zip file out of the large binary file).
Reading some documentation (http://forensicswiki.org/wiki/ZIP), gives me the impression it is easy to parse a zip file since it contains the compressed size
of each compressed file.
Is there a way for me to do that in C or Python without reinventing the wheel ?
Upvotes: 1
Views: 3142
Reputation: 112239
A zip entry is permitted to not contain the compressed size in the local header. There is a flag bit to have a descriptor with the compressed size, uncompressed size, and CRC follow the compressed data.
It would be more reliable to search for end-of-central-directory headers, use that to find the central directories, and use that to find the local headers and entries. This will require attention to detail, very carefully reading the PKWare appnote that describes the zip format. You will need to handle the Zip64 format as well, which has additional headers and fields.
It is possible a zip entry to be stored, i.e. copied verbatim into that location in the zip file, and it is possible for that entry to itself be a zip file. So make sure that you handle the case of embedded zip files, extracting only the outermost zip files.
Upvotes: 2
Reputation: 61
There are some standard ways to handle zip files in python for example but as far as i know (not that i'm an expert) you first need to supply the actual file somehow. I suggest looking at the zip file format specification.
You should be able to find the other information you need based on the relative position to the magic number. If I'm not mistaken the CRC-32 is the magic number, so jumping forward 4 bytes will get you to the compressed size, and another 8 bytes should get you the file name.
extra field length 2 bytes
file name (variable size)
Hope that helps a little bit at least :)
Upvotes: 1