Reputation: 13
I have two problems I'm trying to solve. 1) When I run hashlib md5 I'm getting a different output than when I run md5sum in bash. 2) Running the program in python takes much longer than bash.
Also, I have a table of md5sum values I want to match this test file and others to. The bash output in my test case matches the value I was provided in the table. So ideally I would like to get the python output to match that.
Here's what I've tried so far:
import os
import hashlib
import gzip
import time
import subprocess
def get_gzip_md5(in_file):
#gets the md5sum value for a given file
hash_md5 = hashlib.md5()
chunk = 8192
with gzip.open(in_file, "rb") as f:
while True:
buffer = f.read(chunk)
if not buffer:
break
hash_md5.update(buffer)
return hash_md5.hexdigest()
t0 = time.process_time()
out = subprocess.run("md5sum test.fastq.gz", shell=True, stdout=subprocess.PIPE)
print(out.stdout)
t1 = time.process_time() - t0
print("Time elapsed:",t1)
t0 = time.process_time()
md5 = get_gzip_md5("test.fastq.gz")
print(md5)
t1 = time.process_time() - t0
print("Time elapsed:",t1)
Output:
b'b0a25d66a1a83582e088f062983128ed test.fastq.gz\n'
Time elapsed: 0.007306102000256942
cfda2978db7fab4c4c5a96c61c974563
Time elapsed: 95.02966231200026
Upvotes: 1
Views: 547
Reputation: 9267
The problem comes from how you opened your file in Python
.
Short Answer:
You need to change
gzip.open(in_file, "rb")
by
open(in_file, "rb")
And you'll have the same MD5 sum.
Long answer:
gzip.open()
will uncompress your .gz
file and will read it's content in rb
mode. however in the same time, md5sum
will process the MD5 sum of the compressed file. So, it'll lead to different MD5 sum values.
How to solve this issue ? Simply, open
the compressed file in rb
and get it's MD5 sum without uncompressing it.
Upvotes: 1