Mystery UTF-8-like encoding

Question

I've been given a file that is supposedly in UTF-8, but there are some weird encodings for some of the non-english characters. For example, in this mystery encoding, the Hangul string

한국경북영덕군강구면

is encoded as:

0xED959C 0xEAB5AD 0xEAB2BD 0xEBB63F 0xEC983F 0xEB3F95 0xEAB5B0 0xEAB095 0xEAB5AC 0xEBA9B4

(differences in bold) rather than the standard UTF-8:

0xED959C 0xEAB5AD 0xEAB2BD 0xEBB681 0xEC9881 0xEB8D95 0xEAB5B0 0xEAB095 0xEAB5AC 0xEBA9B4"

I'm seeing the same phenomena with Cyrillic and Chinese characters--some characters have the same encoding as UTF-8, but some are different. The garbled characters have the same byte width as the non garbled ones and I've verified they aren't part of an extension set. Also, I've already verified this is not Java "Modified UTF-8".

Any other ideas as to what this may be?

BTW: I don't have access to the code or people who originally wrote the file.

Also, I'm on Mac 10.11.6 in case that has anything to do with it.

Mystery UTF-8-like encoding

Answers (1)

Related Questions