Is there a way to fix wrong encoded strings?

Question

I am getting this string via a message broker (Stomp):

JoÃÂ£o

and that's how it suposed to be:

João

Is there a way to revert this in Java?! Thanks!

dfb · Accepted Answer

U+00C3  Ã   c3 83   LATIN CAPITAL LETTER A WITH TILDE
U+00C2  Â   c3 82   LATIN CAPITAL LETTER A WITH CIRCUMFLEX
U+00A3  £   c2 a3   POUND SIGN
U+00E3  ã   c3 a3   LATIN SMALL LETTER A WITH TILDE

I'm having trouble determining how this could be a data (encoding) conversion problem. Is it possible the data is just bad?

If the data isn't bad, then we have to assume you are misinterpreting the encoding. We don't know the original encoding and unless you're doing something different, the default encoding for Java is UTF-16. I don't see how João encoded in any common encoding could be interpreted as JoÃÂ£o in UTF-16

Just to be sure, I whipped this python script up with no match found. I'm not entirely sure it covers all encodings or I'm not missing a corner case, FWIW.

#!/usr/bin/env python                                                                                                                   
# -- coding: utf-8 --                                                                                                                   
import pkgutil
import encodings

good = u'João'
bad = u'JoÃÂ£o'

false_positives = set(["aliases"])

found = set(name for imp, name, ispkg in pkgutil.iter_modules(encodings.__path__) if not ispkg)
found.difference_update(false_positives)
print found


for x in found:
    for y in found:
        res = None
        try:
            res =  good.encode(x).decode(y)
            print res,x,y
        except:
            pass
        if not res is None:
            if res == bad:
                print "FOUND"
                exit(1)

Is there a way to fix wrong encoded strings?

Answers (2)

Related Questions