streaming md5sum of contents of a large remote tarball

Question

I need to fetch a .tar.gz archive from an HTTP server and perform an MD5sum of each file it contains. Since the archive is 4.5GB compressed, 12GB decompressed, I'd like to do so without touching the hard drive. Of course I can't keep everything in RAM either.

I'm trying to use python for it, but my problem is that for some weird reason the tarfile module tries to seek() to the end of the input file handle - which is something that you can't do with piped streams. Ideas?

import tarfile
import hashlib
import subprocess
URL = 'http://myhost/myfile.tar.gz'

url_fh = subprocess.Popen('curl %s | gzip -cd' % URL, shell=True, stdout=subprocess.PIPE)
tar_fh = tarfile.open(mode='r', fileobj=url_fh.stdout)
for tar_info in tar_fh:
    content_fh = tar_fh.extractfile(tar_info)
    print hashlib.md5(content_fh.read()).hexdigest(), tar_info.name
tar_fh.close()

The above fails with:

Traceback (most recent call last):
  File "gzip_pipe.py", line 13, in 
    tar_fh = tarfile.open(mode='r', fileobj=url_fh.stdout)
  File "/algo/algos2dev4/AlgoOne-EC/third-party-apps/python/lib/python2.6/tarfile.py", line 1644, in open
    saved_pos = fileobj.tell()
IOError: [Errno 29] Illegal seek

jfs · Accepted Answer

To find md5 sums of all files in a remote archive on-the-fly:

#!/usr/bin/env python
import tarfile
import sys
import hashlib
from contextlib import closing
from functools import partial

try:
    from urllib.request import urlopen
except ImportError: # Python 2
    from urllib2 import urlopen

def md5sum(file, bufsize=1<<15):
    d = hashlib.md5()
    for buf in iter(partial(file.read, bufsize), b''):
        d.update(buf)
    return d.hexdigest()

url = sys.argv[1] # url to download
with closing(urlopen(url)) as r, tarfile.open(fileobj=r, mode='r|*') as archive:
    for member in archive:
        if member.isreg(): # extract only regular files from the archive
            with closing(archive.extractfile(member)) as file:
                print("{name}	{sum}".format(name=member.name, sum=md5sum(file)))

streaming md5sum of contents of a large remote tarball

Answers (1)

Related Questions