starlocke
starlocke

Reputation: 3661

What can explain this bad character-encoding?

What "stack" of bad encoding would produce the following bytes of weirdness for the string "cinéma télédiffusion"? (I left out the space character, hex: 20)

cinÃ%ma
in HEX: 63 69 6E C3 83 25 6D 61
mapped: c  i  n  ---�----  m  a

tÃclÃcdiffusion
in HEX: 74 C3 83 63 6C C3 83 63 64 69 66 66 75 73 69 6F 6E
mapped: t  ---�---- l  ---�---- d  i  f  f  u  s  i  o  n

The ---�---- parts represent the bytes that aren't right.

I considered the idea "What if it was a messed up transcoding? How about a double encoding?", but, looking at http://www.fileformat.info/info/unicode/char/00e9/charset_support.htm (and the code page edition, too), I noted that there no encodings that could possibly end é with the hex bytes %25 or %63. It doesn't even look like double-UTF8 encoding at this point, because, http://en.wikipedia.org/wiki/UTF-8 clarified that bytes following a %C3 would need to be have the first bits set to 10xxxxxx.

How could some program have turned the accented é into an "Ã followed by %" as well as "Ã followed by c"? I want to trace back the history of the misencoding so that I can try to come up with something that can take steps at repairing the mangled strings.

There also exists the possibility that the é weren't ever é to begin with, but I can't fathom what kind of typo someone could have made in the same phrase to get two different versions of é that eventually get misencoded into two completely different sets of bytes.

Extra context details: I find these mangled strings inside of an XML file. The file has no <?xml version="1.0"?> header, so it's presumed to be UTF-8. There exists nodes containing phrases that have perfectly good é characters in them at the same time that there exists nodes containing phrases with mangled é characters.

iconv-and-family don't do anything at all to help this situation, as far as I've attempted.

A couple of trailing considerations that I now hold are: Should I suspect MySQL and its infamously lazy character set transcodings? Could it be somebody's really badly written custom encoding function as they exported the XML?

Upvotes: 5

Views: 1751

Answers (1)

PowerStat
PowerStat

Reputation: 3821

The encoding looks a bit strange:

Taken the é from cinéma results in for utf-8 encoding:

é = C3 A9

where you got:

C3 83 25

So when it will be double encoded the following should happen:

c3: Ã -> c3 83

a9: © -> c2 a9

But this will not explain the 25 within your result.

25: %

So the question is if this is encode once, then unknown characters like © will be replaced by % and then it's encoded a second time?

Upvotes: 1

Related Questions