How did SourceForge maim this Unicode character?

Question

A little encoding puzzle for you.

A comment on a SourceForge tracker item contains the character U+2014, EM DASH, which is rendered by the web interface as — like it should.

In the XML export, however, it shows up as:

â€”

Decoding the entities, that results in these code points:

U+00E2 U+20AC U+201D

I.e. the characters â€”. The XML should have been —, the decimal representation of 0x2014, so this is probably a bug in the SF.net exporter.

Now I'm looking to reverse the process, but I can't find a way to get the above output from this Unicode character, no matter what erroneous encoding/decoding sequence I try. Any idea what happened here and how to reverse the process?

BalusC · Accepted Answer

The the XML output is incorrectly been encoded using CP1252. To revert this, convert â€” to bytes using CP1252 encoding and then convert those bytes back to string/char using UTF-8 encoding.

Java based evidence:

String s = "â€”";
System.out.println(new String(s.getBytes("CP1252"), "UTF-8")); // —

Note that this assumes that the stdout console uses by itself UTF-8 to display the character.

How did SourceForge maim this Unicode character?

Answers (2)

Related Questions