Reputation: 304
As explain in this article https://medium.com/@mpreziuso/is-gzip-deterministic-26c81bfd0a49 the md5 of two .tar.gz files that are the compression of the exact same set of files can be different. This is because it, for example, includes timestamp in the header of the compressed file.
In the article 3 solutions are proposed, and I would ideally like to use the first one which is:
We can use the -n flag in gzip which will make gzip omit the timestamp and the file name from the file header;
And this solution works well:
tar -c ./bin |gzip -n >one.tar.gz
tar -c ./bin |gzip -n >two.tar.gz
md5sum one.tgz two.tgz
Nevertheless I have no idea of what will be a good way to do it in Python. Is there a way to do it with tarfile(https://docs.python.org/2/library/tarfile.html)?
Upvotes: 6
Views: 3540
Reputation: 112404
Sure, you can eliminate dates and other non-file information in the tar and gzip headers, and use the same version of the same compressor with the same settings, all in order to get exactly the same archive bytes.
However doing all that leads me to think that you are solving the wrong problem, and that you will run into issues if someone changes the version of the compressor under you, with signatures not matching from before and after the version change.
I would recommend instead that you generate your signatures using the concatenation of the uncompressed file contents. Then your signature will be naturally independent of all of the things you are currently having to go to some lengths to zero out, and will also be independent of changes in the compression code. Then all you will need to do is to take some care to preserve the order of the files in the archive.
Upvotes: 2
Reputation: 5568
I needed to archive many files in one tar file (not just one), and the above answers didn't work for me. Instead, I used the Linux tar
command with Python's subprocess
module:
import subprocess
import shlex
def make_tarfile_linux(folder_path, filename):
"""
Make idempotent tarfile for an identical checksum each time.
However, this method does not filter out unwanted files like Python can...
"""
tarfile_to_create_path_and_filename = f"/home/user/{filename}"
tar_command = "tar --sort=name --owner=root:0 --group=root:0 --mtime='UTC 1970-01-01' -cjf"
command_list = shlex.split(f"{tar_command} {tarfile_to_create_path_and_filename} {folder_path}")
cp = subprocess.run(command_list)
return None
Upvotes: 1
Reputation: 560
Martin's answer is correct, but in my case I wanted to ignore the last modified date of each file in the tar as well, so that even if a file was "modified" but with no actual changes, it still has the same hash.
When creating the tar, I can override values I don't care about so they are always the same.
In this example I show that just using a normal tar.bz2, if I re-create my source file with a new creation timestamp, the hash will change (1 and 2 are the same, after re-creation, 4 will differ). However, if I set the time to Unix Epoch 0 (or any other arbitrary time), my files will all hash the same (3, 5 and 6)
To do this you need to pass a filter
function to tar.add(DIR, filter=tarInfoStripFileAttrs)
that removes the desired fields, as in the example below
import tarfile, time, os
def createTestFile():
with open(DIR + "/someFile.txt", "w") as file:
file.write("test file")
# Takes in a TarInfo and returns the modified TarInfo:
# https://docs.python.org/3/library/tarfile.html#tarinfo-objects
# intented to be passed as a filter to tarfile.add
# https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.add
def tarInfoStripFileAttrs(tarInfo):
# set time to epoch timestamp 0, aka 00:00:00 UTC on 1 January 1970
# note that when extracting this tarfile, this time will be shown as the modified date
tarInfo.mtime = 0
# file permissions, probably don't want to remove this, but for some use cases you could
# tarInfo.mode = 0
# user/group info
tarInfo.uid= 0
tarInfo.uname = ''
tarInfo.gid= 0
tarInfo.gname = ''
# stripping paxheaders may not be required
# see https://stackoverflow.com/questions/34688392/paxheaders-in-tarball
tarInfo.pax_headers = {}
return tarInfo
# COMPRESSION_TYPE = "gz" # does not work even with filter
COMPRESSION_TYPE = "bz2"
DIR = "toTar"
if not os.path.exists(DIR):
os.mkdir(DIR)
createTestFile()
tar1 = tarfile.open("one.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar1.add(DIR)
tar1.close()
tar2 = tarfile.open("two.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar2.add(DIR)
tar2.close()
tar3 = tarfile.open("three.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar3.add(DIR, filter=tarInfoStripFileAttrs)
tar3.close()
# Overwrite the file with the same content, but an updated time
time.sleep(1)
createTestFile()
tar4 = tarfile.open("four.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar4.add(DIR)
tar4.close()
tar5 = tarfile.open("five.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar5.add(DIR, filter=tarInfoStripFileAttrs)
tar5.close()
tar6 = tarfile.open("six.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar6.add(DIR, filter=tarInfoStripFileAttrs)
tar6.close()
$ md5sum one.tar.bz2 two.tar.bz2 three.tar.bz2 four.tar.bz2 five.tar.bz2 six.tar.bz2
0e51c97a8810e45b78baeb1677c3f946 one.tar.bz2 # same as 2
0e51c97a8810e45b78baeb1677c3f946 two.tar.bz2 # same as 1
54a38d35d48d4aa1bd68e12cf7aee511 three.tar.bz2 # same as 5/6
22cf1161897377eefaa5ba89e3fa6acd four.tar.bz2 # would be same as 1/2, but timestamp has changed
54a38d35d48d4aa1bd68e12cf7aee511 five.tar.bz2 # same as 3, even though timestamp has changed
54a38d35d48d4aa1bd68e12cf7aee511 six.tar.bz2 # same as 3, even though timestamp has changed
You may want to tweak which params are modified and how in your filter function based on your use case.
Upvotes: 7
Reputation: 13313
As a workaround you can use the bzip2
compression instead. It does not seem to have this problem:
import tarfile
tar1 = tarfile.open("one.tar.bz2", "w:bz2")
tar1.add("bin")
tar1.close()
tar2 = tarfile.open("two.tar.bz2", "w:bz2")
tar2.add("bin")
tar2.close()
Running the md5
gives:
martin@martin-UX305UA:~/test$ md5sum one.tar.bz2 two.tar.bz2
e9ec2fd4fbdfae465d43b2f5ecaecd2f one.tar.bz2
e9ec2fd4fbdfae465d43b2f5ecaecd2f two.tar.bz2
Upvotes: 4