Reputation: 861
Updated Question
I know how to use python to create a md5 hash from a file http://docs.python.org/3.5/library/hashlib.html#hash-algorithms. I also know how to read a text file line by line. However my files can grow large, and it is inefficient to read the file twice from beginning to end. I wonder whether it is possible to read the data only once from disc, and like in a stream/pipe, combine the 2 tasks intelligently. May be something like:
The objective is to become more efficient, by reading the (large) files from disc just once, instead of twice, by intelligently combining binary md5 calculation and text based processing on the same file.
I hope this explains it better. Thanks again for your help.
Juergen
Upvotes: 1
Views: 2746
Reputation: 861
This seems to work in python 3.6
#!/usr/bin/env python
import io
import hashlib
class MD5Pipe(io.BytesIO):
def __init__(self, fd):
self.fd = fd
self.hasher = hashlib.md5()
def readinto(self, b):
l = self.fd.readinto(b)
# print("readinto: ", l, len(b))
if l > 0:
self.hasher.update(b[0:l])
return l
def hexdigest(self):
return self.hasher.hexdigest()
blocksize = 65536
file = "c:/temp/PIL/VTS/VTS_123.csv"
with open(file, "rb") as fd:
with MD5Pipe(fd) as md5:
with io.BufferedReader(md5) as br:
with io.TextIOWrapper(br, newline='', encoding="utf-8") as reader:
for line in reader:
print("line: ", line, end="")
print("md5: ", md5.hexdigest())
Upvotes: 2
Reputation: 1121186
Yes, just create a single hashlib.md5()
object and update it with each chunk:
md5sum = hashlib.md5()
buffer_size = 2048 # 2kb, adjust as needed.
with open(..., 'rb') as fileobj:
# read a binary file in chunks
for chunk in iter(lambda: fileobj.read(buffer_size), b''):
# update the hash object
md5sum.update(chunk)
# produce the final hash digest in hex.
print(md5sum.hexdigest())
If you need to also read the data as text, you'll have to write your own wrapper:
either one that implements the TextIOBase
API (implement all stub methods that relate to reading), and draw data from the BufferedIOReader
object produced by the open(..., 'rb')
call each time a line is requested. You'll have to do your own line splitting and decoding at that point.
or one that implements the BufferedIOBase
API (again implement all stub methods), and pass this as the buffer to a TextIOWrapper
class.
Upvotes: 1