Steve Bennett
Steve Bennett

Reputation: 126687

How to diagnose, and reverse (not prevent) Unicode mangling

Somewhere upstream of me, "something" happened that looks like unicode mangling. One symptom is that a lowercase u umlaut (ü) gets converted to "ü" (ie, character FC gets converted to C3 BC). Assuming that I have no control over this upstream process, how can I reverse-engineer what's going on? And if that is possible, can I crank the sausage machine backwards and get the original text back?

(If it helps to understand this case, the text I received was in the form of a MySQL dump. I think somwewhere in the dump/transport process it got mangled.)

Upvotes: 2

Views: 831

Answers (2)

Kilian Foth
Kilian Foth

Reputation: 14386

Your text isn't 'mangled'. It's just in UTF8. C3 BC is what the ü is supposed to be encoded as. Just set whatever software you use to UTF8 also, and all pain will go away. If you can't set your software to Unicode, seriously consider switching to newer software.

I know it's scary at first, but you will have to do that eventually, anyway. My favorite music typesetter switched to Unicode-only input a while ago (they even deliberately removed support for the old 8-bit code pages to get people to switch), and I was upset, thinking that Latin-1 was good enough for me, and it was stupid to break stuff that was working perfectly well... and then I got over it and just set emacs to Unicode buffers, and now I'll never have to think about character encoding again in my life!

Upvotes: 4

mkluwe
mkluwe

Reputation: 4061

First of all, it looks like you've got UTF-8 encoded text (as you've found ü interpreted in your expected encoding, maybe Latin-1).

You could guess this encoding being used by checking that the correct byte sequences are used (and the illegal ones not used, of course). See the Wikipedia article for a reference and look for valid and invalid byte sequences. You can be pretty sure about the encoding if the text starts with a BOM, but that's not required for UTF-8.

To get the text back in your required encoding, several tools are available, GNU recode for one.

Upvotes: 2

Related Questions