Reputation: 1218
I am trying to ensure that 2 archives with the same files inside produce the same MD5 checksum.
For example, file1.txt and file2.txt have identical content, the only difference between them is creation time. However, they produce the same MD5:
>>> import md5
>>> md5.md5(open("file1.zip","rb").read()).hexdigest()
'c99e47de6046f141693b9aecdbdd2dc2'
>>> md5.md5(open("file2.zip","rb").read()).hexdigest()
'c99e47de6046f141693b9aecdbdd2dc2'
However, when I create tarfile (or zipfile) archives for both identical files, I get completely different MD5s. Note I am using tarfile for file 1 and 2 in the exact same fashion.
>>> import tarfile, md5
>>> #file 1
>>> a1 = tarfile.open('archive1.tar.gz','w:gz')
>>> a1.add("file1.txt")
>>> a1.close()
>>> md5.md5(open("archive1.zip","rb").read()).hexdigest()
'0865abb94f6fd92df990963c75519b2e'
>>> #file 2
>>> a2 = tarfile.open('archive2.tar.gz','w:gz')
>>> a2.add("file2.txt")
>>> a2.close()
>>> md5.md5(open("archive2.zip","rb").read()).hexdigest()
'cee53e271a1f457dfd5b5401d8311fcc'
Any ideas why this is occurring? I am guessing it has something to do with the header data in the archive that is causing this. Perhaps the archives maintain the different creation times of file1 and file2, thus the different checksums.
Upvotes: 3
Views: 1829
Reputation: 31
try to use zipfile:
key point: give func writestr a object of ZipInfo, not a str.
because if not a object of ZipInfo, zinfo will get a dynamic date_time.
the only dynamic variable will write to zip file header, so zip file md5 will change.
Upvotes: 3
Reputation: 1411
Whilst the payload of the two archives may be identical, the underlying structure of the archives is different, and compression only adds to those differences.
Zip and Tar are both archiving formats, and they can both be combined with compression; more often than not, they are. The combinations of differing compression algorithms and fundamentally different underlying format structure will result in different MD5s.
--
In this case, the last modification time and names of the underlying files are different, even though the contents of the files are the same; this results in a different MD5.
Upvotes: 1