Reputation: 735
I would like to compute the hash of the contents (sequence of bits) of a file (whose length could be any number of bits, and so not necessarily a multiple of the trendy eight) and send that file to a friend along with the hash-value. My friend should be able to compute the same hash from the file contents. I want to use Python 3 to compute the hash, but my friend can't use Python 3 (because I'll wait till next year to send the file and by then Python 3 will be out of style, and he'll want to be using Python++ or whatever). All I can guarantee is that my friend will know how to compute the hash, in a mathematical sense---he might have to write his own code to run on his implementation of the MIX machine (which he will know how to do).
What hash do I use, and, more importantly, what do I take the hash of? For example, do I hash the str
returned from a read
on the file opened for reading as text? Do I hash some bytes
-like object returned from a binary read
? What if the file has weird end-of-line markers? Do I pad the tail end first so that the thing I am hashing is an appropriate size?
import hashlib
FILENAME = "filename"
# Now, what?
I say "sequence of bits" because not all computers are based on the 8-bit byte, and saying "sequence of bytes" is therefore too ambiguous. For example, GreenArrays, Inc. has designed a supercomputer on a chip, where each computer has 18-bit (eighteen-bit) words (when these words are used for encoding native instructions, they are composed of three 5-bit "bytes" and one 3-bit byte each). I also understand that before the 1970's, a variety of byte-sizes were used. Although the 8-bit byte may be the most common choice, and may be optimal in some sense, the choice of 8 bits per byte is arbitrary.
Upvotes: 0
Views: 1345
Reputation: 735
I arrived at sha256hexdigestFromFile
, an alternative to @Lincoln Yan 's calculateSHA256Hash
, after reviewing the standard for SHA-256.
This is also a response to my comment about 2048
.
def sha256hexdigestFromFile(filePath, blocks = 1):
'''Return as a str the SHA-256 message digest of contents of
file at filePath.
Reference: Introduction of NIST (2015) Secure Hash
Standard (SHS), FIPS PUB 180-4. DOI:10.6028/NIST.FIPS.180-4
'''
assert isinstance(blocks, int) and 0 < blocks, \
'The blocks argument must be an int greater than zero.'
with open(filePath, 'rb') as MessageStream:
from hashlib import sha256
from functools import reduce
def hashUpdated(Hash, MESSAGE_BLOCK):
Hash.update(MESSAGE_BLOCK)
return Hash
def messageBlocks():
'Return a generator over the blocks of the MessageStream.'
WORD_SIZE, BLOCK_SIZE = 4, 512 # PER THE SHA-256 STANDARD
BYTE_COUNT = WORD_SIZE * BLOCK_SIZE * blocks
yield MessageStream.read(BYTE_COUNT)
return reduce(hashUpdated, messageBlocks(), sha256()).hexdigest()
Upvotes: 0
Reputation: 347
First of all, the hash()
function in Python is not the same as cryptographic hash functions in general. Here're the differences:
hash()
A hash is an fixed sized integer that identifies a particular value. Each value needs to have its own hash, so for the same value you will get the same hash even if it's not the same object.
Note that the hash of a value only needs to be the same for one run of Python. In Python 3.3 they will in fact change for every new run of Python
A cryptographic hash function (CHF) is a mathematical algorithm that maps data of an arbitrary size (often called the "message") to a bit array of a fixed size
It is deterministic, meaning that the same message always results in the same hash.
https://en.wikipedia.org/wiki/Cryptographic_hash_function
Now let's come back to your question:
I would like to compute the hash of the contents (sequence of bits) of a file (whose length could be any number of bits, and so not necessarily a multiple of the trendy eight) and send that file to a friend along with the hash-value. My friend should be able to compute the same hash from the file contents.
What you're looking for is one of the cryptographic hash functions. Typically, to calculate the file hash, MD5, SHA-1, SHA-256 are used. You want to open the file as binary and hash the binary bits, and finally digest it & encode it in hexadecimal form.
import hashlib
def calculateSHA256Hash(filePath):
h = hashlib.sha256()
with open(filePath, "rb") as f:
data = f.read(2048)
while data != b"":
h.update(data)
data = f.read(2048)
return h.hexdigest()
print(calculateSHA256Hash(filePath = 'stackoverflow_hash.py'))
The above code takes itself as an input, hence it produced an SHA-256 hash for itself, being 610e15155439c75f6b63cd084c6a235b42bb6a54950dcb8f2edab45d0280335e
. This remains consistent as long as the code is not changed.
Another example would be to hash a txt file, test.txt
with content Helloworld
.
This is done by simply changing the last line of the code to "test.txt"
print(calculateSHA256Hash(filePath = 'text.txt'))
This gives a SHA-256 hash of 5ab92ff2e9e8e609398a36733c057e4903ac6643c646fbd9ab12d0f6234c8daf
.
Upvotes: 5