John B
John B

Reputation: 107

create a string from byte array does not return same length

I have this problem, I receive a String in a method that in database must be limited to 200(Varchar), with certain characters although the length of the String is less than 200, apparently the bytes length is more than 200, so I tried to make this:

Get the bytes length of the String

byte[] nameBytes = name.getBytes("UTF-8");

then if nameBytes.length > 200 I try to create a new String with a subarray of the original nameBytes like this:

name = new String(Arrays.copyOfRange(nameBytes, 0, 200), "UTF-8");

I am sure that Arrays.copyOfRange(nameBytes, 0, 200) is returning an array of length 200, but for some reason when I create the new String, this revision name.getBytes("UTF-8").length gives me 201, so I dont know why is adding one more byte.

There is something I am doing wrong? or there is a way to be sure o creating an array of the same length of the char array?

Thanks in advance.

Upvotes: 3

Views: 1456

Answers (1)

atao
atao

Reputation: 845

First some exemples:



        String cs;
        String name = "façade";
        byte[] nameBytes;        

        System.out.println(String.format("String '%s': %d", name, name.length()));
        cs = "UTF-8";
        nameBytes = name.getBytes(Charset.forName(cs));
        System.out.println(String.format("%s: %d / %d", cs, nameBytes.length, new String(nameBytes, cs).length()));
        cs = "UTF-16";
        nameBytes = name.getBytes(Charset.forName(cs));
        System.out.println(String.format("%s: %d / %d", cs, nameBytes.length, new String(nameBytes, cs).length()));
        cs = "UTF-16BE";
        nameBytes = name.getBytes(Charset.forName(cs));
        System.out.println(String.format("%s: %d / %d", cs, nameBytes.length, new String(nameBytes, cs).length()));

with the output:



    String 'façade': 6  ---> 6 characters with one outside ASCII range
    UTF-8: 7 / 6 ---> 'ç' requires 2 bytes, the others only one
    UTF-16: 14 / 6 ---> 2 x 6 bytes for code points + 2 bytes for BOM
    UTF-16BE: 12 / 6 ---> no need to embedded the BOM here => 2 x 6 bytes are enough

Comments:

  • always specify a charset, i.e. in both ways
  • about BOM, see Byte order mark
  • dixit Unicode Character Representations: The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities.

The issue here is about the charset used in your database. If it's UTF-8, then you would have to check character by character when you hit the 200 bytes limit. With UTF-8, you can't cut the string on an arbitrary byte number: it can be in the middle of any 2 bytes character. The result is then unpredictable.

Upvotes: 1

Related Questions