Reputation: 12592

Why new String with UTF-8 contains more bytes

byte bytes[] = new byte[16];
random.nextBytes(bytes);
try {
   return new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
   log.warn("Hash generation failed", e);
}

When I generate a String with given method, and when i apply string.getBytes().length it returns some other value. Max was 32. Why a 16 byte array ends up generating a another size byte string ?

But if i do string.length() it returns 16.

Upvotes: 3

Answers (6)

Alex Salauyou

Reputation: 14348

This is because your bytes are first converted to Unicode string, which attempts to create UTF-8 char sequence from these bytes. If a byte cannot be treated as ASCII char nor captured with next byte(s) to form legal unicode char, it is replaced by "�". Such char is transformed into 3 bytes when calling String#getBytes(), thus adding 2 extra bytes to resulting output.

If you're lucky to generate ASCII chars only, String#getBytes() will return 16-byte array, if no, resulting array may be longer. For example, the following code snippet:

byte[] b = new byte[16]; 
Arrays.fill(b, (byte) 190);  
b = new String(b, "UTF-8").getBytes();

returns array of 48(!) bytes long.

Upvotes: 6

SubOptimal

Reputation: 22993

The generated bytes might contain valid multibyte characters.

Take this as example. The string contains only one character, but as byte representation it take three bytes.

String s = "Ω";
System.out.println("length = " + s.length());
System.out.println("bytes = " + Arrays.toString(s.getBytes("UTF-8")));

String.length() return the length of the string in characters. The character Ω is one character whereas it's a 3 byte long in UTF-8.

If you change your code like this

Random random = new Random();
byte bytes[] = new byte[16];
random.nextBytes(bytes);
System.out.println("string = " + new String(bytes, "UTF-8").length());
System.out.println("string = " + new String(bytes, "ISO-8859-1").length());

The same bytes are interpreted with a different charset. And following the javadoc from String(byte[] b, String charset)

The length of the new String is a function of the charset, and hence may
not be equal to the length of the byte array.

Upvotes: 3

kuporific

Reputation: 10322

If you look at the string you're producing, most of the random bytes you're generating do not form valid UTF-8 characters. The String constructor, therefore, replaces them with the unicode 'REPLACEMENT CHARACTER' �, which takes up 3 bytes, 0xFFFD.

As an example:

public static void main(String[] args) throws UnsupportedEncodingException
{
    Random random = new Random();

    byte bytes[] = new byte[16];
    random.nextBytes(bytes);
    printBytes(bytes);

    final String s = new String(bytes, "UTF-8");
    System.out.println(s);
    printCharacters(s);
}

private static void printBytes(byte[] bytes)
{
    for (byte aByte : bytes)
    {
        System.out.print(
                Integer.toHexString(Byte.toUnsignedInt(aByte)) + " ");
    }
    System.out.println();
}

private static void printCharacters(String s)
{
    s.codePoints().forEach(i -> System.out.println(Character.getName(i)));
}

On a given run, I got this output:

30 41 9b ff 32 f5 38 ec ef 16 23 4a 54 26 cd 8c 
0A��2�8��#JT&͌
DIGIT ZERO
LATIN CAPITAL LETTER A
REPLACEMENT CHARACTER
REPLACEMENT CHARACTER
DIGIT TWO
REPLACEMENT CHARACTER
DIGIT EIGHT
REPLACEMENT CHARACTER
REPLACEMENT CHARACTER
SYNCHRONOUS IDLE
NUMBER SIGN
LATIN CAPITAL LETTER J
LATIN CAPITAL LETTER T
AMPERSAND
COMBINING ALMOST EQUAL TO ABOVE

Upvotes: 1

Joop Eggen

Reputation: 109613

This will try to create a String assuming the bytes are in UTF-8.

new String(bytes, "UTF-8");

This in general will go horribly wrong as UTF-8 multi-byte sequences can be invalid.

Like:

String s = new String(new byte[] { -128 }, StandardCharsets.UTF_8);

The second step:

byte[] bytes = s.getBytes();

will use the platform encoding (System.getProperty("file.encoding")). Better specify it.

byte[] bytes = s.getBytes(StandardCharsets.UTF_8);

One should realize, internally String will maintain Unicode, an array of 16-bit char in UTF-16.

One should entirely abstain from using String for byte[]. It will always involve a conversion, cost double memory and be error prone.

Upvotes: 0

fge

Reputation: 121830

Classical mistake born from the misunderstanding of the relationship between bytes and chars, so here we go again.

There is no 1-to-1 mapping between byte and char; it all depends on the character coding you use (in Java, that is a Charset).

Worse: given a byte sequence, it may or may not be encoded to a char sequence.

Try this for instance:

final byte[] buf = new byte[16];
new Random().nextBytes(buf);

final Charset utf8 = StandardCharsets.UTF_8;
final CharsetDecoder decoder = utf8.newDecoder()
    .onMalformedInput(CodingErrorAction.REPORT);

decoder.decode(ByteBuffer.wrap(buf));

This is very likely to throw a MalformedInputException.

I know this is not exactly an answer but then you didn't clearly explain your problem; and the example above shows already that you have the wrong understanding between what a byte is and what a char is.

Upvotes: 3

Subhan

Reputation: 1634

String.getBytes().length is likely to be longer, as it counts bytes needed to represent the string, while length() counts 2-byte code units.

Why new String with UTF-8 contains more bytes

Answers (6)

Related Questions