Reputation: 2217
I have used hashlib (which replaces md5 in Python 2.6/3.0), and it worked fine if I opened a file and put its content in the hashlib.md5()
function.
The problem is with very big files that their sizes could exceed the RAM size.
How can I get the MD5 hash of a file without loading the whole file into memory?
Upvotes: 217
Views: 160122
Reputation: 5113
As mentioned in @pseyfert's comment; in Python 3.11 and above, hashlib.file_digest()
can be used. While not explicitly documented, internally the function uses a chunking approach similar to the one in the accepted answer, as can be seen from its source code (lines 230–236).
The function also provides a keyword-only argument _bufsize
with a default value of 2^18 = 262,144 bytes that controls the buffer size for chunking; however, given its leading underscore and missing documentation, it should probably rather be considered an implementation detail.
In any case, the following code equivalently reproduces the accepted answer in Python 3.11+ (apart from the different chunk size):
import hashlib
with open("your_filename.txt", "rb") as f:
file_hash = hashlib.file_digest(f, "md5") # or `hashlib.md5` as 2nd arg
print(file_hash.digest())
print(file_hash.hexdigest()) # to get a printable str instead of bytes
Upvotes: 0
Reputation: 26512
import hashlib
def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):
h = hash_factory()
with open(filename,'rb') as f:
for chunk in iter(lambda: f.read(chunk_num_blocks*h.block_size), b''):
h.update(chunk)
return h.digest()
import hashlib
def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):
h = hash_factory()
with open(filename,'rb') as f:
while chunk := f.read(chunk_num_blocks*h.block_size):
h.update(chunk)
return h.digest()
If you want a more Pythonic (no while True
) way of reading the file, check this code:
import hashlib
def checksum_md5(filename):
md5 = hashlib.md5()
with open(filename,'rb') as f:
for chunk in iter(lambda: f.read(8192), b''):
md5.update(chunk)
return md5.digest()
Note that the iter()
function needs an empty byte string for the returned iterator to halt at EOF, since read()
returns b''
(not just ''
).
Upvotes: 126
Reputation: 2516
I don't like loops. Based on Nathan Feger's answer:
md5 = hashlib.md5()
with open(filename, 'rb') as f:
functools.reduce(lambda _, c: md5.update(c), iter(lambda: f.read(md5.block_size * 128), b''), None)
md5.hexdigest()
Upvotes: 0
Reputation: 6331
I think the following code is more Pythonic:
from hashlib import md5
def get_md5(fname):
m = md5()
with open(fname, 'rb') as fp:
for chunk in fp:
m.update(chunk)
return m.hexdigest()
Upvotes: 1
Reputation: 22942
A Python 2/3 portable solution
To calculate a checksum (md5, sha1, etc.), you must open the file in binary mode, because you'll sum bytes values:
To be Python 2.7 and Python 3 portable, you ought to use the io
packages, like this:
import hashlib
import io
def md5sum(src):
md5 = hashlib.md5()
with io.open(src, mode="rb") as fd:
content = fd.read()
md5.update(content)
return md5
If your files are big, you may prefer to read the file by chunks to avoid storing the whole file content in memory:
def md5sum(src, length=io.DEFAULT_BUFFER_SIZE):
md5 = hashlib.md5()
with io.open(src, mode="rb") as fd:
for chunk in iter(lambda: fd.read(length), b''):
md5.update(chunk)
return md5
The trick here is to use the iter()
function with a sentinel (the empty string).
The iterator created in this case will call o [the lambda function] with no arguments for each call to its
next()
method; if the value returned is equal to sentinel,StopIteration
will be raised, otherwise the value will be returned.
If your files are really big, you may also need to display progress information. You can do that by calling a callback function which prints or logs the amount of calculated bytes:
def md5sum(src, callback, length=io.DEFAULT_BUFFER_SIZE):
calculated = 0
md5 = hashlib.md5()
with io.open(src, mode="rb") as fd:
for chunk in iter(lambda: fd.read(length), b''):
md5.update(chunk)
calculated += len(chunk)
callback(calculated)
return md5
Upvotes: 10
Reputation: 1503
Implementation of Yuval Adam's answer for Django:
import hashlib
from django.db import models
class MyModel(models.Model):
file = models.FileField() # Any field based on django.core.files.File
def get_hash(self):
hash = hashlib.md5()
for chunk in self.file.chunks(chunk_size=8192):
hash.update(chunk)
return hash.hexdigest()
Upvotes: 0
Reputation: 771
A remix of Bastien Semene's code that takes the Hawkwing comment about generic hashing function into consideration...
def hash_for_file(path, algorithm=hashlib.algorithms[0], block_size=256*128, human_readable=True):
"""
Block size directly depends on the block size of your filesystem
to avoid performances issues
Here I have blocks of 4096 octets (Default NTFS)
Linux Ext4 block size
sudo tune2fs -l /dev/sda5 | grep -i 'block size'
> Block size: 4096
Input:
path: a path
algorithm: an algorithm in hashlib.algorithms
ATM: ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512')
block_size: a multiple of 128 corresponding to the block size of your filesystem
human_readable: switch between digest() or hexdigest() output, default hexdigest()
Output:
hash
"""
if algorithm not in hashlib.algorithms:
raise NameError('The algorithm "{algorithm}" you specified is '
'not a member of "hashlib.algorithms"'.format(algorithm=algorithm))
hash_algo = hashlib.new(algorithm) # According to hashlib documentation using new()
# will be slower then calling using named
# constructors, ex.: hashlib.md5()
with open(path, 'rb') as f:
for chunk in iter(lambda: f.read(block_size), b''):
hash_algo.update(chunk)
if human_readable:
file_hash = hash_algo.hexdigest()
else:
file_hash = hash_algo.digest()
return file_hash
Upvotes: 5
Reputation: 4559
I'm not sure that there isn't a bit too much fussing around here. I recently had problems with md5 and files stored as blobs in MySQL, so I experimented with various file sizes and the straightforward Python approach, viz:
FileHash = hashlib.md5(FileData).hexdigest()
I couldn’t detect any noticeable performance difference with a range of file sizes 2 KB to 20 MB and therefore no need to 'chunk' the hashing. Anyway, if Linux has to go to disk, it will probably do it at least as well as the average programmer's ability to keep it from doing so. As it happened, the problem was nothing to do with md5. If you're using MySQL, don't forget the md5() and sha1() functions already there.
Upvotes: -2
Reputation: 607
Using multiple comment/answers for this question, here is my solution:
import hashlib
def md5_for_file(path, block_size=256*128, hr=False):
'''
Block size directly depends on the block size of your filesystem
to avoid performances issues
Here I have blocks of 4096 octets (Default NTFS)
'''
md5 = hashlib.md5()
with open(path,'rb') as f:
for chunk in iter(lambda: f.read(block_size), b''):
md5.update(chunk)
if hr:
return md5.hexdigest()
return md5.digest()
Upvotes: 31
Reputation: 19496
Here's my version of Piotr Czapla's method:
def md5sum(filename):
md5 = hashlib.md5()
with open(filename, 'rb') as f:
for chunk in iter(lambda: f.read(128 * md5.block_size), b''):
md5.update(chunk)
return md5.hexdigest()
Upvotes: 54
Reputation:
You need to read the file in chunks of suitable size:
def md5_for_file(f, block_size=2**20):
md5 = hashlib.md5()
while True:
data = f.read(block_size)
if not data:
break
md5.update(data)
return md5.digest()
Note: Make sure you open your file with the 'rb' to the open - otherwise you will get the wrong result.
So to do the whole lot in one method - use something like:
def generate_file_md5(rootdir, filename, blocksize=2**20):
m = hashlib.md5()
with open( os.path.join(rootdir, filename) , "rb" ) as f:
while True:
buf = f.read(blocksize)
if not buf:
break
m.update( buf )
return m.hexdigest()
The update above was based on the comments provided by Frerich Raabe - and I tested this and found it to be correct on my Python 2.7.2 Windows installation
I cross-checked the results using the jacksum tool.
jacksum -a md5 <filename>
Upvotes: 238
Reputation: 6492
You can't get its md5 without reading the full content. But you can use the update function to read the file's content block by block.
m.update(a); m.update(b) is equivalent to m.update(a+b).
Upvotes: 1
Reputation: 165172
Break the file into 8192-byte chunks (or some other multiple of 128 bytes) and feed them to MD5 consecutively using update()
.
This takes advantage of the fact that MD5 has 128-byte digest blocks (8192 is 128×64). Since you're not reading the entire file into memory, this won't use much more than 8192 bytes of memory.
In Python 3.8+ you can do
import hashlib
with open("your_filename.txt", "rb") as f:
file_hash = hashlib.md5()
while chunk := f.read(8192):
file_hash.update(chunk)
print(file_hash.digest())
print(file_hash.hexdigest()) # to get a printable str instead of bytes
Upvotes: 198
Reputation: 1
import hashlib,re
opened = open('/home/parrot/pass.txt','r')
opened = open.readlines()
for i in opened:
strip1 = i.strip('\n')
hash_object = hashlib.md5(strip1.encode())
hash2 = hash_object.hexdigest()
print hash2
Upvotes: -5