Manuel Carrasco
Manuel Carrasco

Reputation: 77

Is String.getBytes being safely used?

Currently, I need to work with the bytes of a String in Java, and it has raised so many questions about encodings and implementation details of the JVM. I would like to know if what I'm doing makes sense, or it is redundant.

To begin with, I understand that at runtime a Java char in a String will always represent a symbol in Unicode.

Secondly, the UTF-8 encoding is always able to successfully encode any symbol in Unicode. In turn, the following snippet will always return a byte[] without doing any replacement. getBytes documentation is here.

byte[] stringBytes = myString.getBytes(StandardCharsets.UTF_8);

Then, if stringBytes is used in a different JVM instance in the following way, it will always yield a string equivalent to myString.

new String(stringBytes, StandardCharsets.UTF_8);

Do you think that my understanding of getBytes is correct? If that is the case, how would you justify it? Am I missing a pathological case which could lead me not to get an equivalent version of myString?

Thanks in advance.


EDIT:

Would you agree that by doing the following any non-exceptional flow leads to a handled case, which allow us to successfully reconstruct the string?


EDIT:

Based on the answers, here goes the solution which allows you to safely reconstruct strings when no exception is thrown. You still need to handle the exception somehow.

First, get the bytes using the encoder:

final CharsetEncoder encoder =
    StandardCharsets.UTF_8.
        .newEncoder()
        .onUnmappableCharacter(CodingErrorAction.REPORT)
        .onMalformedInput(CodingErrorAction.REPORT);


// It throws a CharacterCodingException in case there is a replacement or malformed string
// The given array is actually bigger than required because it is the internal array used by the ByteBuffer. Read its doc.
byte[] stringBytes = encoder.encode(CharBuffer.wrap(string)).array();

Second, construct the string using the bytes given by the encoder (non-exceptional path):

new String(stringBytes, StandardCharsets.UTF_8);

Upvotes: 4

Views: 912

Answers (1)

Sweeper
Sweeper

Reputation: 271175

it will always yield a string equivalent to myString.

Well, not always. Not a lot of things in this world happens always.

One edge case I can think of is that myString could be an "invalid" string when you call getBytes. For example, it could have a lone surrogate pair:

String myString = "\uD83D";

How often this will happen heavily depends on what you are doing with myString, so I'll let you think about that on your own.

If myString has a lone surrogate pair, getBytes would encode a question mark character for it:

// prints "?"
System.out.println(
    new String(myString.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8)
);

I wouldn't say a ? is "equivalent" to a malformed string.

See also: Is an instance of a Java string always valid UTF-16?

Upvotes: 3

Related Questions