SHA256 different values for same String

Question

I am generating the SHA256 of the following string

{
    "billerid": "MAHA00000MUM01",
    "authenticators": 
    [
        {
            "parameter_name": "CA Number",
            "value": "210000336768"
        }
    ],
    "customer": 
    {
        "firstname": "ABC",
        "lastname": "XYZ",
        "mobile": "9344895862",
        "mobile_alt": "9859585525",
        "email": "abc@billdesk.com",
        "email_alt": "abc2@billdesk.com",
        "pan": "BZABC1234L",
        "aadhaar": "123123123123"
    },
    "metadata": 
    {
        "agent": 
        {
            "agentid": "DC01DC31MOB528199558"
        },
        "device": 
        {
            "init_channel": "Mobile",
            "ip": "124.124.1.1",
            "imei": "490154203237518",
            "os": "Android",
            "app": "AGENTAPP"
        }
    },
    "risk":
    [
        {
          "score_provider": "DC31",
          "score_value": "030",
          "score_type": "TXNRISK"
        },
        {
          "score_provider": "BBPS",
          "score_value": "030",
          "score_type": "TXNRISK"
        }
    ]
}

I am getting different SHA256 output from different sources. This website: https://www.freeformatter.com/sha256-generator.html#ad-output calculates the SHA256 of the above string: 053353867b8171a8949065500d7313c69fe7517c9d69eaff11164c35fcb14457

This website(https://emn178.github.io/online-tools/sha256.html) gives the SHA256 as eae5c26759881d48a194a6b82a9d542485d6b6ce96297275c136b1fa6712f253

I am using CryptoJs library in Javascript to calculate SHA256 which also gives eae5c26759881d48a194a6b82a9d542485d6b6ce96297275c136b1fa6712f253 this result.

I want the SHA256 calculated to be: 053353867b8171a8949065500d7313c69fe7517c9d69eaff11164c35fcb14457

Why these is difference in SHA256 calculation over different places?

Maarten Bodewes · Accepted Answer

The problem that you are experiencing is due to encoding differences. There are several reasons why encoding of the same string may produce different results:

different line endings (CR/LF for Windows, LF for Linux, CR for classic MacOS);
other differences in whitespace (tab or spaces, whitespace in line endings);
different character encodings (Windows-1252, UTF-8 and UTF-16 or internal character representation within language implementations);
the presence of meta information (presence of a Byte Order Mark);
different ways of handling special characters within an encoding (a character followed by a combining tilde and the character with the combining tilde, see Unicode equivalence);

There are also possible invisible errors that may produce different results:

the presence of unprintable characters / control codes (a null value, 0x00, at the end of the string is probably the best example);

Besides all of these differences that may be present for any (structured) text, JSON data structures could also have equivalent values. Probably the best example is a leading + character before a number. This is entirely spurious but will still result in a different textual representation but an identical value for the number.

If the encoding of the string differs then the binary input of the hash algorithm differs, and you will get results that differ by about 50% of the bits for a common cryptographic hash. The way to produce the same input is called canonicalization (or C14N, as there are 14 characters between the C and N of canonicalization).

For XML a canonical form has been defined long ago. For JSON this is not the case, even though canonicalization of JSON would be much easier. JSON has a much less convoluted set of rules after all. There are attempts to canonicalize JSON, see e.g. this draft RFC explicitly mentions cryptographic hashes:

For example when a cryptographic hash is applied over a JSON document, a single physical representation allows the hash to represent the logical content of the document by removing variation in how that content is encoded in JSON.

This draft RFC looks a bit more thorough, by the way.

For now you could keep to one of the draft RFCs. If you want to keep the newlines then you could serialize the JSON using these well defined rules and use that as input to the hash function, while keeping the JSON itself untouched. That way differently formatted JSON would still generate the same hash.

[Input JSON] -> (parse) -> (canonicalize & serialize) -> (hash) -> [hash value]
[Input JSON'] -> (parse) -> (canonicalize & serialize) -> (hash) -> [hash value']

Here the hash output would be identical if the Input JSON and Input JSON' are structurally / semantically the same, as the canonicalization would smooth out the differences.

Note that JSON Web Signatures (JWS) side steps this issue. Signatures use a hash internally after all. The signature is over an included payload, and the encoding of that payload is simply used. This is fine as long as an intermediate system doesn't re-encode the JSON. Signatures do not have to be identical, they just need to verify the data.

Unfortunately, that's not the case for hashes. However, in practice, you could define the JSON as a file and use the same reasoning. The drawback is of course that if you get a difference you will have to perform a binary compare to find the differences and then trace back where the change was introduced. Working systems may break the hash while the semantics are still the same (e.g. when replacing or updating a JSON library).

SHA256 different values for same String

Answers (1)

Related Questions