MGT
MGT

Reputation: 13

Python hashlib md5 different output and slow speed compared to bash md5sum?

I have two problems I'm trying to solve. 1) When I run hashlib md5 I'm getting a different output than when I run md5sum in bash. 2) Running the program in python takes much longer than bash.

Also, I have a table of md5sum values I want to match this test file and others to. The bash output in my test case matches the value I was provided in the table. So ideally I would like to get the python output to match that.

Here's what I've tried so far:

import os
import hashlib
import gzip
import time
import subprocess

def get_gzip_md5(in_file):
    #gets the md5sum value for a given file
    
    hash_md5 = hashlib.md5()
    chunk = 8192
    
    with gzip.open(in_file, "rb") as f:
        
        while True:
            buffer = f.read(chunk)
            if not buffer:
                break
            hash_md5.update(buffer)

    return hash_md5.hexdigest()

t0 = time.process_time()

out = subprocess.run("md5sum test.fastq.gz", shell=True, stdout=subprocess.PIPE)
print(out.stdout)

t1 = time.process_time() - t0
print("Time elapsed:",t1)


t0 = time.process_time()

md5 = get_gzip_md5("test.fastq.gz")
print(md5)

t1 = time.process_time() - t0
print("Time elapsed:",t1)

Output:

b'b0a25d66a1a83582e088f062983128ed  test.fastq.gz\n'
Time elapsed: 0.007306102000256942
cfda2978db7fab4c4c5a96c61c974563
Time elapsed: 95.02966231200026

Upvotes: 1

Views: 547

Answers (1)

Chiheb Nexus
Chiheb Nexus

Reputation: 9267

The problem comes from how you opened your file in Python.

Short Answer:

You need to change

gzip.open(in_file, "rb")

by

open(in_file, "rb")

And you'll have the same MD5 sum.

Long answer:

gzip.open() will uncompress your .gz file and will read it's content in rb mode. however in the same time, md5sum will process the MD5 sum of the compressed file. So, it'll lead to different MD5 sum values.

How to solve this issue ? Simply, open the compressed file in rb and get it's MD5 sum without uncompressing it.

Upvotes: 1

Related Questions