Razican
Razican

Reputation: 727

SHA-1 shows different output in UTF-8 Java

I have created a Sha1 function that works in most times the same way as PHP's sha1 function, and gives the same output. But when UTF-8 characters appear, they differ. For example, with the string "hj6¬", in PHP I get "7f9d591232c5fde9f757c4d8472921517991dc3c" while in my Java function I get "c963b7df20488e9ef50c1a309c1fa747ab5d8822". Here is the Java function:

https://github.com/Razican/Java-Utils/blob/master/src/razican/utils/StringUtils.java#L115

Which one is the correct one? How can I implement it in Java?

Upvotes: 0

Views: 889

Answers (1)

McDowell
McDowell

Reputation: 108879

The correct output is 7f9d591232c5fde9f757c4d8472921517991dc3c. You are dropping a byte:

        final MessageDigest md = MessageDigest.getInstance("SHA-1");
        md.update(str.getBytes("UTF-8"), 0, str.length());
        sha1hash = md.digest();

The above code assumes that the length of the UTF-16 string equals the length of the UTF-8 encoded byte array. If the the UTF-8 form is longer than the UTF-16 form the digest will be incorrect.

codepoint   glyph   escaped    UTF-8           info
=======================================================================
U+0068      h       \u0068     68,             BASIC_LATIN, LOWERCASE_LETTER
U+006a      j       \u006a     6a,             BASIC_LATIN, LOWERCASE_LETTER
U+0036      6       \u0036     36,             BASIC_LATIN, DECIMAL_DIGIT_NUMBER
U+00ac      ¬       \u00ac     c2,ac,          LATIN_1_SUPPLEMENT, MATH_SYMBOL

Using the length of the array:

        byte[] utf8 = str.getBytes(StandardCharsets.UTF_8);
        md.update(utf8, 0, utf8.length);

You could also use md.update(str.getBytes(StandardCharsets.UTF_8))

Upvotes: 1

Related Questions