Phil Hannent
Phil Hannent

Reputation: 12317

What is a good way to test a file to see if its a zip file?

I am looking as a new file format specification and the specification says the file can be either xml based or a zip file containing an xml file and other files.

The file extension is the same in both cases. What ways could I test the file to decide if it needs decompressing or just reading?

Upvotes: 19

Views: 22254

Answers (10)

Lőrinc Bethlenfalvy
Lőrinc Bethlenfalvy

Reputation: 57

God, this thread is a mess. ZIP deliberately doesn't rely on a header because it was always intended to support multipurpose files. The most popular examples of this are self-extracting archives, but a lot of other formats that are defined in terms of ZIP such as JAR also define a magic number.

Your best bet is to read through the entire file for 0x07064b50 (ZIP64 End of Central Directory locator signature) followed by 16 bytes of arbitrary data, then 0x06054b50 (legacy EoCD record signature).

At some point Wikipedia stated that the EoCD record is at the very end of the file. This is technically true, but the EoCD record ends with a variable length text field preceded by its length, and I can't find a specific encoding in the spec, so I don't think it's possible to deduce the exact offset of the signatures from the end of the file.

If this is unacceptably slow, you're probably better off verifying that the file starts like valid XML. This is a pretty good heuristic because

  • An XML file will never be a valid ZIP because the headers I mentioned above include forbidden characters
  • <!DOCTYP is not a recognized magic, and it's safe to say it'll never become one
  • Binary file formats usually either use a magic number or consist of blocks that start with a signature, just like ZIP. Although a ZIP file doesn't necessarily start with any particular sequence, as a binary file it's safe to say that it will start with some specific sequence and not user data that could accidentally be valid XML.

Upvotes: 0

RvdK
RvdK

Reputation: 19790

File magic numbers

To clarify, it starts with 50 4b 03 04.

See http://www.pkware.com/documents/casestudies/APPNOTE.TXT (From Simon P Stevens)

Upvotes: 1

Simon P Stevens
Simon P Stevens

Reputation: 27499

The zip file format is defined by PKWARE. You can find their file specification here.

Near the top you will find the header specification:

A. Local file header:

    local file header signature     4 bytes  (0x04034b50)
    version needed to extract       2 bytes
    general purpose bit flag        2 bytes
    compression method              2 bytes
    last mod file time              2 bytes
    last mod file date              2 bytes
    crc-32                          4 bytes
    compressed size                 4 bytes
    uncompressed size               4 bytes
    file name length                2 bytes
    extra field length              2 bytes

    file name (variable size)
    extra field (variable size)

From this you can see that the first 4 bytes of the header should be the file signature which should be the hex value 0x04034b50. Byte order in the file is the other way round - PKWARE specify that "All values are stored in little-endian byte order unless otherwise specified.", so if you use a hex editor to view the file you will see 50 4b 03 04 as the first 4 bytes.

You can use this to check if your file is a zip file. If you open the file in notepad, you will notice that the first two bytes (50 and 4b) are the ASCII characters PK.

Upvotes: 32

Thomas Matthews
Thomas Matthews

Reputation: 57688

You could check the file to see if it contains a valid XML header. If it doesn't, try decompressing it.

See Click here for XML specification.

Upvotes: 1

Kamran Khan
Kamran Khan

Reputation: 9986

Not a good solution though, but just thinking out load... how about:

try
{
LoadXmlFile(theFile);//Exception if not an xml file
}
catch(Exception ex)
{
LoadZipFile(theFile)
}

Upvotes: 1

solsTiCe
solsTiCe

Reputation: 9

it depends on what you are using but the zip library might have a function that test wether a file or not is a zip file something like is_zip, test_file_zip or whatever ...

or create you're own function by using the magic number given above.

Upvotes: 0

ccheneson
ccheneson

Reputation: 49410

You can use file to see if it's a text file(xml) or an executable(zip). Scroll down to see an example.

Upvotes: 1

Dominic Rodger
Dominic Rodger

Reputation: 99751

You could try unzipping it - an XML file is exceedingly unlikely to be a valid zip file, or could check the magic numbers, as others have said.

Upvotes: 0

Yacoby
Yacoby

Reputation: 55445

Check the first few bytes of the file for the magic number. Zip files begin with PK (50 4B). As XML files cannot start with these characters and still be valid, you can be fairly sure as to the file type.

Upvotes: 1

Amber
Amber

Reputation: 526593

You could look at the magic number of the file. The ones for ZIP archives are listed on the ZIP format wikipedia page: PK\003\004 or PK\005\006.

Upvotes: 12

Related Questions