mike
mike

Reputation: 1005

Why do I get 2 different binary files when I 'zip' 2 identical directories?

This is on a Mac if it matters. zip is version 3.0 and unzip is version 6.0 (I expect what is shipped with the OS).

If I do the following:

Start with a generic 'pptx' file, unzip it into a directory, clean up the XML, then zip it up

unzip V1.pptx -d dir
cd dir
find . -name "*.xml" -type f -exec xmllint --output '{}' --format '{}' \;
zip -0 ../V1Orig.pptx -r *

I now have a new zip file V1Orig.pptx

unzip V1Orig.pptx -d copy
cd copy
find . -name "*.xml" -type f -exec xmllint --output '{}' --format '{}' \;
zip -0 ../V1Copy.pptx -r *

If I now 'diff' the orig and copy directories, they are the same:

Common subdirectories: orig/_rels and copy/_rels
Common subdirectories: orig/docProps and copy/docProps
Common subdirectories: orig/ppt and copy/ppt

But if I diff the pptx files or do an md5 checksum on the pptx I get a different answer.

diff V1Orig.pptx V1Copy.pptx
Binary files V1Orig.pptx and V1Copy.pptx differ

ls -rtla orig
total 8
drwxr-xr-x  11 fultonm  wheel   352 10 Jan 16:49 ppt
drwxr-xr-x   5 fultonm  wheel   160 10 Jan 16:49 docProps
drwxr-xr-x   3 fultonm  wheel    96 10 Jan 16:49 _rels
drwxr-xr-x   6 fultonm  wheel   192 14 Jan 10:40 .
-rw-r--r--   1 fultonm  wheel  3212 14 Jan 10:42 [Content_Types].xml
drwxr-xr-x   8 fultonm  wheel   256 14 Jan 10:57 ..
fultonm@mikes-MacBook-Pro-2 /tmp/handzip>ls -rtla copy
total 8
drwxr-xr-x   5 fultonm  wheel   160 14 Jan 10:42 docProps
drwxr-xr-x   3 fultonm  wheel    96 14 Jan 10:42 _rels
drwxr-xr-x   6 fultonm  wheel   192 14 Jan 10:42 .
drwxr-xr-x  11 fultonm  wheel   352 14 Jan 10:42 ppt
-rw-r--r--   1 fultonm  wheel  3212 14 Jan 10:42 [Content_Types].xml
drwxr-xr-x   8 fultonm  wheel   256 14 Jan 10:57 ..

Upvotes: 0

Views: 44

Answers (2)

Mark Adler
Mark Adler

Reputation: 112394

You can get them to be the same by making the timestamps of all of the files and directories to be the same, and by using the -X option to not save extra file attribute information.

So for each zip command, use -rX, and in the copy directory do:

find . -exec touch -r ../dir/{} {} \;

before the zip.

Why it should matter that the zip files be identical, I have no idea. What matters is that they both decompress to the same thing.

Upvotes: 1

mike
mike

Reputation: 1005

I believe the problem is that the time stamps are being recorded. These of course will be different because the xmllint process changes the times.

If I do unzip -l I see the order of files being put into the zip file is stable since it looks to be sorted by name and not time, but the date and time stamp are being recorded and of course, those are different.

The 'fix' is likely to ensure the time stamps are not updated by any of the unzip/xmllint steps so that when it is zipped up again, it has the original time stamps.

Better answers appreciated!

Upvotes: 0

Related Questions