Daniel Lucraft
Daniel Lucraft

Reputation: 7356

How to convert a UTF-8 byteOffset into a charOffset for a Java String?

I have a byte offset for a byte array containing a UTF-8 encoded string, how can I transform that into a char offset for the corresponding Java String?

NB. this question used to read:

I have a byte offset into a standard Java String, and I would like to convert that to a character offset.

In practice this will mean a method like charOffsetBefore(int byteOffset) since any byte offset could be in the middle of a code point.

Thanks.

Upvotes: 5

Views: 1346

Answers (2)

John Allen
John Allen

Reputation: 159

I would suggest that you do not have a byte offset into a standard Java String. If indeed you do, can yu tell us who you got it (code please)

Upvotes: 1

Aaron Digulla
Aaron Digulla

Reputation: 328564

Please be extremely wary of your terminology, otherwise you'll get confused. There is no such thing as "byte offset into a Java string". Java strings are made up from 16bit characters.

So I assume that you have a byte array and an offset and you want to convert that into a Java string and still preserve locations (so you can map back and forth).

This depend on the encoding of the byte array. If it's UTF-8, then any byte that has it's MSB set is part of a encoding sequence. Search for the byte which byte & 0xc0 == 0xc0. That's the start of the encoding sequence (see the Wikipedia article).

If you're asking about characters, then the encoding is UTF-16 and you need to look for surrogate pairs.

Upvotes: 3

Related Questions