Reputation: 2177
I want python to read to the EOF so I can get an appropriate hash, whether it is sha1 or md5. Please help. Here is what I have so far:
import hashlib
inputFile = raw_input("Enter the name of the file:")
openedFile = open(inputFile)
readFile = openedFile.read()
md5Hash = hashlib.md5(readFile)
md5Hashed = md5Hash.hexdigest()
sha1Hash = hashlib.sha1(readFile)
sha1Hashed = sha1Hash.hexdigest()
print "File Name: %s" % inputFile
print "MD5: %r" % md5Hashed
print "SHA1: %r" % sha1Hashed
Upvotes: 208
Views: 262630
Reputation: 39065
If you don't need to support Python versions before 3.11, you can use hashlib.file_digest() like this:
import hashlib
def sha256sum(filename):
with open(filename, 'rb', buffering=0) as f:
return hashlib.file_digest(f, 'sha256').hexdigest()
When using a Python 3 version less than 3.11: For the correct and efficient computation of the hash value of a file:
'b'
to the filemode) to avoid character encoding and line-ending conversion issues.readinto()
to avoid buffer churning.Example:
import hashlib
def sha256sum(filename):
h = hashlib.sha256()
b = bytearray(128*1024)
mv = memoryview(b)
with open(filename, 'rb', buffering=0) as f:
while n := f.readinto(mv):
h.update(mv[:n])
return h.hexdigest()
Note that the while loop uses an assignment expression which isn't available in Python versions older than 3.8.
With older Python 3 versions you can use an equivalent variation:
import hashlib
def sha256sum(filename):
h = hashlib.sha256()
b = bytearray(128*1024)
mv = memoryview(b)
with open(filename, 'rb', buffering=0) as f:
for n in iter(lambda : f.readinto(mv), 0):
h.update(mv[:n])
return h.hexdigest()
Upvotes: 183
Reputation: 3177
Starting Python 3.11, you can use file_digest()
method, which takes responsibility of reading files:
import hashlib
with open(inputFile, "rb") as f:
digest = hashlib.file_digest(f, "sha256")
Upvotes: 8
Reputation: 12572
TL;DR use buffers to not use tons of memory.
We get to the crux of your problem, I believe, when we consider the memory implications of working with very large files. We don't want this bad boy to churn through 2 gigs of ram for a 2 gigabyte file so, as pasztorpisti points out, we gotta deal with those bigger files in chunks!
import sys
import hashlib
# BUF_SIZE is totally arbitrary, change for your app!
BUF_SIZE = 65536 # lets read stuff in 64kb chunks!
md5 = hashlib.md5()
sha1 = hashlib.sha1()
with open(sys.argv[1], 'rb') as f:
while True:
data = f.read(BUF_SIZE)
if not data:
break
md5.update(data)
sha1.update(data)
print("MD5: {0}".format(md5.hexdigest()))
print("SHA1: {0}".format(sha1.hexdigest()))
What we've done is we're updating our hashes of this bad boy in 64kb chunks as we go along with hashlib's handy dandy update method. This way we use a lot less memory than the 2gb it would take to hash the guy all at once!
You can test this with:
$ mkfile 2g bigfile
$ python hashes.py bigfile
MD5: a981130cf2b7e09f4686dc273cf7187e
SHA1: 91d50642dd930e9542c39d36f0516d45f4e1af0d
$ md5 bigfile
MD5 (bigfile) = a981130cf2b7e09f4686dc273cf7187e
$ shasum bigfile
91d50642dd930e9542c39d36f0516d45f4e1af0d bigfile
Also all of this is outlined in the linked question on the right hand side: Get MD5 hash of big files in Python
Upvotes: 275
Reputation: 87
You do not need to define a function with 5-20 lines of code to do this! Save your time by using the pathlib and hashlib libraries, also py_essentials is another solution, but third-parties are *****.
from pathlib import Path
import hashlib
filepath = '/path/to/file'
filebytes = Path(filepath).read_bytes()
filehash_sha1 = hashlib.sha1(filebytes)
filehash_md5 = hashlib.md5(filebytes)
print(f'MD5: {filehash_md5}')
print(f'SHA1: {filehash_sha1}')
I used a few variables here to show the steps, you know how to avoid it.
What do you think about the below function?
from pathlib import Path
import hashlib
def compute_filehash(filepath: str, hashtype: str) -> str:
"""Computes the requested hash for the given file.
Args:
filepath: The path to the file to compute the hash for.
hashtype: The hash type to compute.
Available hash types:
md5, sha1, sha224, sha256, sha384, sha512, sha3_224,
sha3_256, sha3_384, sha3_512, shake_128, shake_256
Returns:
A string that represents the hash.
Raises:
ValueError: If the hash type is not supported.
"""
if hashtype not in ['md5', 'sha1', 'sha224', 'sha256', 'sha384',
'sha512', 'sha3_224', 'sha3_256', 'sha3_384',
'sha3_512', 'shake_128', 'shake_256']:
raise ValueError(f'Hash type {hashtype} is not supported.')
return getattr(hashlib, hashtype)(
Path(filepath).read_bytes()).hexdigest()
Upvotes: 4
Reputation: 574
FWIW, I prefer this version, which has the same memory and performance characteristics as maxschlepzig's answer but is more readable IMO:
import hashlib
def sha256sum(filename, bufsize=128 * 1024):
h = hashlib.sha256()
buffer = bytearray(bufsize)
# using a memoryview so that we can slice the buffer without copying it
buffer_view = memoryview(buffer)
with open(filename, 'rb', buffering=0) as f:
while True:
n = f.readinto(buffer_view)
if not n:
break
h.update(buffer_view[:n])
return h.hexdigest()
Upvotes: 3
Reputation: 133919
Here is a Python 3, POSIX solution (not Windows!) that uses mmap
to map the object into memory.
import hashlib
import mmap
def sha256sum(filename):
h = hashlib.sha256()
with open(filename, 'rb') as f:
with mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) as mm:
h.update(mm)
return h.hexdigest()
Upvotes: 10
Reputation: 7030
I would propose simply:
def get_digest(file_path):
h = hashlib.sha256()
with open(file_path, 'rb') as file:
while True:
# Reading is buffered, so we can read smaller chunks.
chunk = file.read(h.block_size)
if not chunk:
break
h.update(chunk)
return h.hexdigest()
All other answers here seem to complicate too much. Python is already buffering when reading (in ideal manner, or you configure that buffering if you have more information about underlying storage) and so it is better to read in chunks the hash function finds ideal which makes it faster or at lest less CPU intensive to compute the hash function. So instead of disabling buffering and trying to emulate it yourself, you use Python buffering and control what you should be controlling: what the consumer of your data finds ideal, hash block size.
Upvotes: 42
Reputation: 55
import hashlib
user = input("Enter ")
h = hashlib.md5(user.encode())
h2 = h.hexdigest()
with open("encrypted.txt","w") as e:
print(h2,file=e)
with open("encrypted.txt","r") as e:
p = e.readline().strip()
print(p)
Upvotes: -5
Reputation: 642
I have programmed a module wich is able to hash big files with different algorithms.
pip3 install py_essentials
Use the module like this:
from py_essentials import hashing as hs
hash = hs.fileChecksum("path/to/the/file.txt", "sha256")
Upvotes: 5