Updated Question I know how to use python to create a md5 hash from a file http://docs.python.org/3.5/library/hashlib.html#hash-algorithms . I also know how to read a text file line by line. However my files can grow large, and it is inefficient to read the file twice from beginning to end. I wonder whether it is possible to read the data only once from disc, and like in a stream/pipe, combine the 2 tasks intelligently. May be something like: Initialize md5 open the file in binary mode read a chunk of data (e.g. buffer_size=65536) into a buffer update the md5 with the chunk just read provide the buffer to another stream to continue processing the data use TextIOWrapper(?) to read the data again, but this time it is text read the text line by line. When the buffer is consumed, ask the underlying layer for more data, until EOF. It'll read more binary data, update md5, provide the new buffer ... and I can continue reading line by line (this is like: repeat from step 3 until EOF) upon EOF, I've processed all my text line by line, and have the md5 The objective is to become more efficient, by reading the (large) files from disc just once, instead of twice, by intelligently combining binary md5 calculation and text based processing on the same file. I hope this explains it better. Thanks again for your help. Juergen

Reputation: 861

Calculate md5 on the fly while reading large text file

Updated Question

I know how to use python to create a md5 hash from a file http://docs.python.org/3.5/library/hashlib.html#hash-algorithms. I also know how to read a text file line by line. However my files can grow large, and it is inefficient to read the file twice from beginning to end. I wonder whether it is possible to read the data only once from disc, and like in a stream/pipe, combine the 2 tasks intelligently. May be something like:

Initialize md5
open the file in binary mode
read a chunk of data (e.g. buffer_size=65536) into a buffer
update the md5 with the chunk just read
provide the buffer to another stream to continue processing the data
use TextIOWrapper(?) to read the data again, but this time it is text
read the text line by line. When the buffer is consumed, ask the underlying layer for more data, until EOF. It'll read more binary data, update md5, provide the new buffer ... and I can continue reading line by line (this is like: repeat from step 3 until EOF)
upon EOF, I've processed all my text line by line, and have the md5

The objective is to become more efficient, by reading the (large) files from disc just once, instead of twice, by intelligently combining binary md5 calculation and text based processing on the same file.

I hope this explains it better. Thanks again for your help.

Juergen

Upvotes: 1

Answers (2)

Juergen

Reputation: 861

This seems to work in python 3.6

#!/usr/bin/env python

import io
import hashlib

class MD5Pipe(io.BytesIO):
    def __init__(self, fd):
        self.fd = fd
        self.hasher = hashlib.md5()
    def readinto(self, b):
        l = self.fd.readinto(b)
        # print("readinto: ", l, len(b))
        if l > 0:
            self.hasher.update(b[0:l])
        return l
    def hexdigest(self):
        return self.hasher.hexdigest()

blocksize = 65536
file = "c:/temp/PIL/VTS/VTS_123.csv"
with open(file, "rb") as fd:
    with MD5Pipe(fd) as md5:
        with io.BufferedReader(md5) as br:
            with io.TextIOWrapper(br, newline='', encoding="utf-8") as reader:
                for line in reader:
                    print("line: ", line, end="")

                print("md5: ", md5.hexdigest())

Upvotes: 2

Martijn Pieters

Reputation: 1121186

Yes, just create a single hashlib.md5() object and update it with each chunk:

md5sum = hashlib.md5()

buffer_size = 2048  # 2kb, adjust as needed.

with open(..., 'rb') as fileobj:
    # read a binary file in chunks
    for chunk in iter(lambda: fileobj.read(buffer_size), b''):
        # update the hash object
        md5sum.update(chunk)

# produce the final hash digest in hex.
print(md5sum.hexdigest())

If you need to also read the data as text, you'll have to write your own wrapper:

either one that implements the TextIOBase API (implement all stub methods that relate to reading), and draw data from the BufferedIOReader object produced by the open(..., 'rb') call each time a line is requested. You'll have to do your own line splitting and decoding at that point.
or one that implements the BufferedIOBase API (again implement all stub methods), and pass this as the buffer to a TextIOWrapper class.

Upvotes: 1

Calculate md5 on the fly while reading large text file

Answers (2)

Related Questions