Joker
Joker

Reputation: 11146

String.getBytes() returns array of Unicode chars

I was reading getbytes and from documentation it states that it will return the resultant byte array.

But when i ran the following program, i found that it is returning array of Unicode symbols.

public class GetBytesExample {
    public static void main(String args[]) {
        String str = new String("A");
        byte[] array1 = str.getBytes();
        System.out.print("Default Charset encoding:");
        for (byte b : array1) {
            System.out.print(b);
        }

    }
}

The above program prints output

Default Charset encoding:65

This 65 is equivalent to Unicode representation of A. My question is that where are the bytes whose return type is expected.

Upvotes: 2

Views: 1442

Answers (3)

Stephen C
Stephen C

Reputation: 718788

This 65 is equivalent to Unicode representation of A

It is also equivalent to a UTF-8 representation of A

It is also equivalent to a ASCII representation of A

It is also equivalent to a ISO/IEC 8859-1 representation of A

It so happens that the encoding for A is similar in a lot character encodings, and that these are all similar to the Unicode code-point. And this is not a coincidence. It is a result of the history of character set / character encoding standards.


My question is that where are the bytes whose return type is expected.

In the byte array, of course :-)

You are (just) misinterpreting them.

When you do this:

    for (byte b : array1) {
        System.out.print(b);
    }

you output a series of bytes as decimal numbers with no spaces between them. This is consistent with the way that Java distinguishes between text / character data and binary data. Bytes are binary. The getBytes() method gives a binary encoding (in some character set) of the text in the string. You are then formatting and printing the binary (one byte at a time) as decimal numbers.

If you want more evidence of this, replace the "A" literal with a literal containing (say) some Chinese characters. Or any Unicode characters greater than \u00ff ... expressed using \u syntax.

Upvotes: 2

Henry
Henry

Reputation: 43728

String.getBytes() returns the encoding of the string using the platform encoding. The result depends on which machine you run this. If the platform encoding is UTF-8, or ASCII, or ISO-8859-1, or a few others, an 'A' will be encoded as 65 (aka 0x41).

Upvotes: 1

Andy Turner
Andy Turner

Reputation: 140318

There is no PrintStream.print(byte) overload, so the byte needs to be widened to invoke the method.

Per JLS 5.1.2:

19 specific conversions on primitive types are called the widening primitive conversions:

  • byte to short, int, long, float, or double
  • ...

There's no PrintStream.print(short) overload either.

The next most-specific one is PrintStream.print(int). So that's the one that's invoked, hence you are seeing the numeric value of the byte.

Upvotes: 5

Related Questions