Reputation: 2692
I need to fetch a .tar.gz archive from an HTTP server and perform an MD5sum of each file it contains. Since the archive is 4.5GB compressed, 12GB decompressed, I'd like to do so without touching the hard drive. Of course I can't keep everything in RAM either.
I'm trying to use python for it, but my problem is that for some weird reason the tarfile module tries to seek() to the end of the input file handle - which is something that you can't do with piped streams. Ideas?
import tarfile
import hashlib
import subprocess
URL = 'http://myhost/myfile.tar.gz'
url_fh = subprocess.Popen('curl %s | gzip -cd' % URL, shell=True, stdout=subprocess.PIPE)
tar_fh = tarfile.open(mode='r', fileobj=url_fh.stdout)
for tar_info in tar_fh:
content_fh = tar_fh.extractfile(tar_info)
print hashlib.md5(content_fh.read()).hexdigest(), tar_info.name
tar_fh.close()
The above fails with:
Traceback (most recent call last):
File "gzip_pipe.py", line 13, in <module>
tar_fh = tarfile.open(mode='r', fileobj=url_fh.stdout)
File "/algo/algos2dev4/AlgoOne-EC/third-party-apps/python/lib/python2.6/tarfile.py", line 1644, in open
saved_pos = fileobj.tell()
IOError: [Errno 29] Illegal seek
Upvotes: 3
Views: 623
Reputation: 414345
To find md5 sums of all files in a remote archive on-the-fly:
#!/usr/bin/env python
import tarfile
import sys
import hashlib
from contextlib import closing
from functools import partial
try:
from urllib.request import urlopen
except ImportError: # Python 2
from urllib2 import urlopen
def md5sum(file, bufsize=1<<15):
d = hashlib.md5()
for buf in iter(partial(file.read, bufsize), b''):
d.update(buf)
return d.hexdigest()
url = sys.argv[1] # url to download
with closing(urlopen(url)) as r, tarfile.open(fileobj=r, mode='r|*') as archive:
for member in archive:
if member.isreg(): # extract only regular files from the archive
with closing(archive.extractfile(member)) as file:
print("{name}\t{sum}".format(name=member.name, sum=md5sum(file)))
Upvotes: 3