Way to determine if a charset is multibyte?

Question

Is there a way to determine whether a given Charset (java.nio.charset.Charset) encodes characters using multiple bytes? Or, alternatively, is there a list somewhere of character sets that do/do not use more than one byte per character?

The reason I'm asking is a performance tweak: I need to know how long (in bytes) an arbitrary string is in a given character set. In the case of single-byte encodings, it's simply the length of the string. Knowing whether or not a charset is single-byte will save me from having to re-encode it first.

You might think that this is a puny optimization that couldn't possibly be worth the effort, but a lot of CPU cycles in my application are spent on this sort of nonsense, and the input data I've encountered so far has been in 20+ different charsets.

Jon Skeet · Accepted Answer

The simplest way is probably:

boolean multiByte = charset.newEncoder().maxBytesPerChar() > 1.0f;

Note that newEncoder can throw UnsupportedOperationException though if the Charset doesn't support encoding. While newDecoder isn't documented to throw that, maxCharsPerByte isn't appropriate. You could use averageCharsPerByte - if that's 1 then it's a pretty good indication that it's a single-byte encoding, but in theory you could have some bytes which produce multiple characters, and some that take multiple bytes per character, averaging at 1...

Way to determine if a charset is multibyte?

Answers (1)

Related Questions