Alex R
Alex R

Reputation: 11881

Unicode issue: How to convert ’ to ’ in the response from HttpClient?

The String s and byte[] b in the code below contain different representations of roughly the same thing.

import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;

import org.testng.annotations.Test;

public class Utf8Test {

    @Test
    public void test() throws UnsupportedEncodingException {
        String s = "’";
        byte[] b = new byte[] { (byte) 0xE2, (byte) 0x80, (byte) 0x99 };

        System.out.println(s); // prints ’

        String t = new String(b, Charset.forName("UTF-8"));
        System.out.println(t); // prints ’

        String u = new String(s.getBytes("ISO-8859-1"), Charset.forName("UTF-8"));
        System.out.println(u); // prints ???

        byte[] b2 = new byte[s.length()];
        for(int i=0; i < s.length(); ++i) {
            b2[i] = (byte) (s.charAt(i) & 0xFF);
        }
        String v = new String(b2, Charset.forName("UTF-8"));
        System.out.println(v); // prints ?"

        Assert.assertEquals(s,v); // FAIL
    }

}

How can I convert String s to the same value as String t?

I have already tried the code resulting in String u and String v, and the result is indicated in the comments.

XY Problem This is actually an XY Problem. The String s is being returned in the HttpEntity of an HttpClient call. All I want is the properly decoded response. The above is far easier to reproduce than a whole HTTP stack so let's solve that instead.

Upvotes: 0

Views: 1586

Answers (1)

Alex R
Alex R

Reputation: 11881

This seems to work, but I don't understand why, and I worry it may be platform-dependent:

byte[] d = s.getBytes("cp1252"); 
String w = new String(d, Charset.forName("UTF-8"));
System.out.println(w); // prints ’

Upvotes: 1

Related Questions