Reputation: 3661
What "stack" of bad encoding would produce the following bytes of weirdness for the string "cinéma télédiffusion"? (I left out the space character, hex: 20)
cinÃ%ma
in HEX: 63 69 6E C3 83 25 6D 61
mapped: c i n ---�---- m a
tÃclÃcdiffusion
in HEX: 74 C3 83 63 6C C3 83 63 64 69 66 66 75 73 69 6F 6E
mapped: t ---�---- l ---�---- d i f f u s i o n
The ---�---- parts represent the bytes that aren't right.
I considered the idea "What if it was a messed up transcoding? How about a double encoding?", but, looking at http://www.fileformat.info/info/unicode/char/00e9/charset_support.htm (and the code page edition, too), I noted that there no encodings that could possibly end é with the hex bytes %25 or %63. It doesn't even look like double-UTF8 encoding at this point, because, http://en.wikipedia.org/wiki/UTF-8 clarified that bytes following a %C3 would need to be have the first bits set to 10xxxxxx.
How could some program have turned the accented é into an "Ã followed by %" as well as "Ã followed by c"? I want to trace back the history of the misencoding so that I can try to come up with something that can take steps at repairing the mangled strings.
There also exists the possibility that the é weren't ever é to begin with, but I can't fathom what kind of typo someone could have made in the same phrase to get two different versions of é that eventually get misencoded into two completely different sets of bytes.
Extra context details: I find these mangled strings inside of an XML file. The file has no <?xml version="1.0"?> header, so it's presumed to be UTF-8. There exists nodes containing phrases that have perfectly good é characters in them at the same time that there exists nodes containing phrases with mangled é characters.
iconv-and-family don't do anything at all to help this situation, as far as I've attempted.
A couple of trailing considerations that I now hold are: Should I suspect MySQL and its infamously lazy character set transcodings? Could it be somebody's really badly written custom encoding function as they exported the XML?
Upvotes: 5
Views: 1751
Reputation: 3821
The encoding looks a bit strange:
Taken the é from cinéma results in for utf-8 encoding:
é = C3 A9
where you got:
C3 83 25
So when it will be double encoded the following should happen:
c3: Ã -> c3 83
a9: © -> c2 a9
But this will not explain the 25 within your result.
25: %
So the question is if this is encode once, then unknown characters like © will be replaced by % and then it's encoded a second time?
Upvotes: 1