Reputation: 497
I am trying to understand the SHA-2 algorithm. And it seems that its a bit vague on how people are encoding the message 'L' (see wikipedia's SHA256-2 pseudo code). Is the message encoded in ASCII, UTF-8, or UTF-16? I understand that technically message L could be anything that we decide before encrypting but I want to check my little test program with other sites like https://www.dcode.fr/sha256-hash and I realize I can't even check anything (except the empty "") without knowing if we are padding the '1' and subsequent '0's to the 9 bit representations for the message or 16 bit representations for the message. If I use the ASCII (which in this case is the same as UTF-8) for the word 'dcode' I am expecting the message to start with the following binary sequence:
d:01100100:UTF-8:100
c:01100011:UTF-8:99
o:01101111:UTF-8:111
d:01100100:UTF-8:100
e:01100101:UTF-8:101
0110010001100011011011110110010001100101
can someone verify that I'm thinking of this correctly? And as a side benefit if you know where the standard that says the pre-hashed message should be UTF-8 or UTF-16 (presumably for specific applications) it would be much appreciated.
This answer is close but lacks specificity in its answer
How can i pad the message in sha family
Upvotes: 0
Views: 663
Reputation: 178179
The encoding doesn't matter. The "message L" is just a bunch of bits. Text can be encoded in any encoding you like. It is the bits of the final encoding that are processed by the SHA256 algorithm, so you'll get different answers if text is encoded in UTF8 or UTF16.
When you receive a message, the SHA256 can be validated and then the message can be decoded. The sender would have to tell you both the expected SHA256 and the encoding of the text.
FYI, the site linked1 used the ASCII values of the characters to generate the hash listed. Make sure to use dCode
not dcode
as in the question. Python code below:
>>> import hashlib
>>> hashlib.sha256('dCode'.encode('ascii')).hexdigest()
'254cd63ece8595b5c503783d596803f1552e0733d02fe4080b217eadb17711dd'
As far as padding is concerned, the message (in this case "dCode") is five bytes (40 bits). According to Wikipedia SHA-256:
Pre-processing (Padding):
begin with the original message of length L bits
append a single '1' bit
append K '0' bits, where K is the minimum number >= 0 such that (L + 1 + K + 64) is a multiple of 512
append L as a 64-bit big-endian integer, making the total post-processed length a multiple of 512 bits such that the bits in the message are: 1 <L as 64 bit integer> , (the number of bits will be a multiple of 512)
So "dCode" is 5 bytes (40 bits). At least 9 more bytes must be added (the 1 bit and 7 more K bits to make 1 byte, plus the 64-bit (8-byte) bit-endian value of L. That makes 14 bytes. 64 bytes are needed to make a message that is modulo 512 bits, so 50 more zero bytes must be added before the final 8-byte length. In Python that would be:
>>> def preprocess(msg):
... # original message length in bits
... L = len(msg) * 8
... # append another byte binary 10000000 + the 8-byte big-endian L
... msg += b'\x80' + L.to_bytes(8,'big')
... n = len(msg) * 8 # new total length
... if n % 512 != 0: # if not modulo 512
... n = 512 - n % 512 # how many more bits needed
... n //= 8 # convert to bytes
... # Python magic to insert n bytes in the right place
... msg = msg[:L // 8 + 1] + b'\x00' * n + msg[L // 8 + 1:]
... return msg
...
>>> preprocess(b'dCode').hex()
'64436f64658000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000028'
>>> len(preprocess(b'dCode')) # in bytes
64
Caveat: The above algorithm assumes messages that are multiples of 8 bits in size (byte-oriented messages), but SHA-256 supports any bit length.
1SHA-256 on dCode.fr [online website], retrieved on 2022-06-12, https://www.dcode.fr/sha256-hash
Upvotes: 2