jonmorrey76
jonmorrey76

Reputation: 59

Mystery UTF-8-like encoding

I've been given a file that is supposedly in UTF-8, but there are some weird encodings for some of the non-english characters. For example, in this mystery encoding, the Hangul string

한국경북영덕군강구면

is encoded as:

0xED959C 0xEAB5AD 0xEAB2BD 0xEBB63F 0xEC983F 0xEB3F95 0xEAB5B0 0xEAB095 0xEAB5AC 0xEBA9B4

(differences in bold) rather than the standard UTF-8:

0xED959C 0xEAB5AD 0xEAB2BD 0xEBB681 0xEC9881 0xEB8D95 0xEAB5B0 0xEAB095 0xEAB5AC 0xEBA9B4"

I'm seeing the same phenomena with Cyrillic and Chinese characters--some characters have the same encoding as UTF-8, but some are different. The garbled characters have the same byte width as the non garbled ones and I've verified they aren't part of an extension set. Also, I've already verified this is not Java "Modified UTF-8".

Any other ideas as to what this may be?

BTW: I don't have access to the code or people who originally wrote the file.

Also, I'm on Mac 10.11.6 in case that has anything to do with it.

Upvotes: 2

Views: 230

Answers (1)

ruakh
ruakh

Reputation: 183211

Your example string consists of UTF-8, but with certain byte values (namely x81 and x8D) replaced with the ASCII question mark ? (x3F). The only plausible explanation is that your example string has passed through a piece of software that tried to interpret its contents according to some other encoding (probably a single-byte character set), and that replaced "invalid" characters with ? (analogously to how a Unicode text processor might replace invalid Unicode characters with U+FFFD).

Unfortunately, that process is not really reversible, since at least two distinct byte values (and probably more that don't happen to appear in your example) got replaced, so there's no guaranteed way to identify the original byte value in every case. Depending on how important this is — that is, depending how much time it's worth spending on it — you could potentially identify the complete set of bytes that got replaced, and then write something that tries each possible value for each byte, comparing the resulting character-sequences with (say) bigram frequencies from some corpus of text in the relevant language, and selecting the most-probable byte. (Of course, it will make some mistakes. To estimate the resulting error rate, you can try the same process on a known text.)

Upvotes: 3

Related Questions