user5381685
user5381685

Reputation:

UTF-8 encoded string's byte count isn't as expected

I am not able to understand this: Why does the given code print out 12 and not 11 altough hello world has only 11 characters?

byte[] byteArray = Charset.forName("UTF-8").encode("hello world").array();
System.out.println(byteArray.length);

Upvotes: 2

Views: 373

Answers (4)

TuringTux
TuringTux

Reputation: 579

Using this program, you can figure out what bytes the byte array contains:

byte[] byteArray = Charset.forName("UTF-8").encode("hello world").encoded.array();
for(int i = 0; i < byteArray.length; i++) {
    System.out.println(byteArray[i]+" - "+((char)byteArray[i]));
}

The bytes are (decimal):

104 101 108 108  111 32 119 111  114 108 100 0

The first 11 characters are the UTF-8 encoded string hello world, as expected. The last byte is the Null character, which is used to represent nothing at all.

To deal with this, just use the .limit() method of ByteBuffer as mentioned above.

Upvotes: 0

rmuller
rmuller

Reputation: 12849

Easy to see if you debug the array:

b=68, char=h
b=65, char=e
b=6C, char=l
b=6C, char=l
b=6F, char=o
b=20, char= 
b=77, char=w
b=6F, char=o
b=72, char=r
b=6C, char=l
b=64, char=d
b=0, char=

So last character is \u0000

Upvotes: 3

Francisco C.
Francisco C.

Reputation: 747

I'm not sure what you are trying to accomplish, but to get the byte array of a string, why not just use:

String s = "hello world";
byte[] b = s.getBytes("UTF-8");

assertEquals(s.length(), b.length);

More information in this answer:

How to convert Strings to and from UTF8 byte arrays in Java

Upvotes: 1

fgb
fgb

Reputation: 18569

The array method of ByteBuffer returns the array backing the buffer, but not all bytes are significant. Only the bytes up to limit are used. The following returns 11 as expected:

int limit = Charset.forName("UTF-8").encode("hello world").limit();
System.out.println(limit);

Upvotes: 7

Related Questions