Reputation: 631
Is there a way to determine whether a given Charset (java.nio.charset.Charset) encodes characters using multiple bytes? Or, alternatively, is there a list somewhere of character sets that do/do not use more than one byte per character?
The reason I'm asking is a performance tweak: I need to know how long (in bytes) an arbitrary string is in a given character set. In the case of single-byte encodings, it's simply the length of the string. Knowing whether or not a charset is single-byte will save me from having to re-encode it first.
You might think that this is a puny optimization that couldn't possibly be worth the effort, but a lot of CPU cycles in my application are spent on this sort of nonsense, and the input data I've encountered so far has been in 20+ different charsets.
Upvotes: 2
Views: 1785
Reputation: 1502736
The simplest way is probably:
boolean multiByte = charset.newEncoder().maxBytesPerChar() > 1.0f;
Note that newEncoder
can throw UnsupportedOperationException
though if the Charset
doesn't support encoding. While newDecoder
isn't documented to throw that, maxCharsPerByte
isn't appropriate. You could use averageCharsPerByte
- if that's 1 then it's a pretty good indication that it's a single-byte encoding, but in theory you could have some bytes which produce multiple characters, and some that take multiple bytes per character, averaging at 1...
Upvotes: 6