joaoricardo000
joaoricardo000

Reputation: 4959

Is there a way to fix wrong encoded strings?

I am getting this string via a message broker (Stomp):

João

and that's how it suposed to be:

João

Is there a way to revert this in Java?! Thanks!

Upvotes: 3

Views: 4348

Answers (2)

Peter
Peter

Reputation: 3407

In some cases a hack works. But best is to prevent it from ever happening.

I had this problem before when I had a servlet that correctly printed the correct headers and http content type and encoding on the page, but IE would submit forms encoded with latin1 instead of the correct one. So I created a quick dirty hack (involving a request wrapper that detects and converts if it is indeed IE) to fix it for new data which worked fine. And for the data in the database that was already messed up, I used the following hack.

Unfortunately my hack doesn't work perfectly for your example string, but it looks very close (just an extra à in your broken string compared to my 'theoretical cause' reproduced broken string). So perhaps my guess of "latin1" is wrong, and you should try others (such as in that other link posted by Tomas).

package peter.test;

import java.io.UnsupportedEncodingException;

/**
* User: peter
* Date: 2012-04-12
* Time: 11:02 AM
*/
public class TestEncoding {
    public static void main(String args[]) throws UnsupportedEncodingException {
        //In some cases a hack works. But best is to prevent it from ever happening.
        String good = "João";
        String bad = "João";

        //this line demonstrates what the "broken" string should look like if it is reversible.
        String broken = breakString(good, bad);

        //here we show that it is fixable if broken like breakString() does it.
        fixString(good, broken);

        //this line attempts to fix the string, but it is not fixable unless broken in the same way as breakString()
        fixString(good, bad);
    }

    private static String fixString(String good, String bad) throws UnsupportedEncodingException {
        byte[] bytes = bad.getBytes("latin1"); //read the Java bytes as if they were latin1 (if this works, it should result in the same number of bytes as java characters; if using UTF8, it would be more bytes)
        String fixed = new String(bytes, "UTF8"); //take the raw bytes, and try to convert them to a string as if they were UTF8

        System.out.println("Good: " + good);
        System.out.println("Bad: " + bad);
        System.out.println("bytes1.length: " + bytes.length);
        System.out.println("fixed: " + fixed);
        System.out.println();

        return fixed;
    }

    private static String breakString(String good, String bad) throws UnsupportedEncodingException {
        byte[] bytes = good.getBytes("UTF8");
        String broken = new String(bytes, "latin1");

        System.out.println("Good: " + good);
        System.out.println("Bad: " + bad);
        System.out.println("bytes1.length: " + bytes.length);
        System.out.println("broken: " + broken);
        System.out.println();

        return broken;
    }
}

And the result (with Sun jdk 1.7.0_03):

Good: João
Bad: João
bytes1.length: 5
broken: João

Good: João
Bad: João
bytes1.length: 5
fixed: João

Good: João
Bad: João
bytes1.length: 6
fixed: Jo�£o

Upvotes: 2

dfb
dfb

Reputation: 13289

U+00C3  Ã   c3 83   LATIN CAPITAL LETTER A WITH TILDE
U+00C2  Â   c3 82   LATIN CAPITAL LETTER A WITH CIRCUMFLEX
U+00A3  £   c2 a3   POUND SIGN
U+00E3  ã   c3 a3   LATIN SMALL LETTER A WITH TILDE

I'm having trouble determining how this could be a data (encoding) conversion problem. Is it possible the data is just bad?

If the data isn't bad, then we have to assume you are misinterpreting the encoding. We don't know the original encoding and unless you're doing something different, the default encoding for Java is UTF-16. I don't see how João encoded in any common encoding could be interpreted as João in UTF-16

Just to be sure, I whipped this python script up with no match found. I'm not entirely sure it covers all encodings or I'm not missing a corner case, FWIW.

#!/usr/bin/env python                                                                                                                   
# -- coding: utf-8 --                                                                                                                   
import pkgutil
import encodings

good = u'João'
bad = u'João'

false_positives = set(["aliases"])

found = set(name for imp, name, ispkg in pkgutil.iter_modules(encodings.__path__) if not ispkg)
found.difference_update(false_positives)
print found


for x in found:
    for y in found:
        res = None
        try:
            res =  good.encode(x).decode(y)
            print res,x,y
        except:
            pass
        if not res is None:
            if res == bad:
                print "FOUND"
                exit(1)

Upvotes: 4

Related Questions