Reputation: 697
I have this bizarre problem where my md5 hash from a streamed file does not match md5sum
. The weird thing is if I read the file in and write it out to a second file, the python md5 and md5sum second_file.txt
agree. Here's the hash code:
import hashlib
import sys
file_hash = hashlib.md5()
with open(sys.argv[1], 'r') as f, open(sys.argv[2], 'w') as w:
while True:
c = f.read(1)
w.write(c)
file_hash.update(c.encode(encoding='utf-8'))
if c == '':
# end of file
break
print(file_hash.hexdigest())
Both files are in UTF-8
and running in a docker container. I'm kind of at a loss here. Any ideas?
Upvotes: 0
Views: 1993
Reputation: 113930
open the file in "rb"
mode to get the raw bytes, and skip the encode
bit ... you are effectively changing the bytes that md5 is comparing when doing this
Upvotes: 2
Reputation: 3294
In general the problem could be python or the md5sum function from linux, hence it would be preferred if you provide the linux command line that shows the different hashes. In my experience this most likely happens when one attempts pipe from "echo" but forgets that "echo" adds a newline character to whatever it echo's.
For example, these DO NOT match:
>> echo 'thing' | md5sum
>> python -c "import hashlib;print(hashlib.md5(b'thing').hexdigest())"
Use "printf" to prevent the newline from being added. These DO match:
>> printf 'thing' | md5sum
>> python -c "import hashlib;print(hashlib.md5(b'thing').hexdigest())"
You can also place the data in a file:
>> printf 'thing' > temp
>> cat temp | md5sum
>> python -c "import hashlib;print(hashlib.md5(b'thing').hexdigest())"
Upvotes: 1