Angel
Angel

Reputation: 671

Converting string from UTF-8 to ANSI and displaying it as UTF-8

I want to mimic with Java one thing I can do with Notepad++.

TEXT_2 = convert(TEXT_1) // where: TEXT_2 = "Български", TEXT_1 = "БългарÑки"

How to do it with Notepad++

Setting the starting point...

Open Notepad++ and click: Encoding / Encode in UTF-8, then paste TEXT_1:

БългарÑки

Getting TEXT_2

Click: Encoding / Convert to ANSI, then click: Encoding / Encode in UTF-8. Done.

How to do it with Java

So far I have the following function (which works partially):

public static String convert(String text) {
    String output = new String(Charset.forName("Cp1252").encode(text).array(), Charset.forName("UTF8"));
    return output;
}
System.out.println(convert("БългарÑки"));

With this function I get:

Българ�?ки // where correct is slightly different: Български

any idea to make it work?.

If possible, could you provide the code that would work inside the function convert()?. Thanks.

Upvotes: 0

Views: 9342

Answers (2)

erickson
erickson

Reputation: 269627

There is information lost in "БългарÑ_кÐ"; there should be another character at "_", but Cp1252 does not map any character to the byte value 0x81. That byte comes from encoding "с" to the byte sequence 0xD1 0x81.

It's possible that when you copy corrupted text directly from the source, an unprintable control code (the C1 code "HOP", High Octet Preset) is included in the clipboard data, and Notepad++ gets the complete information. But this control character is probably getting lost when copied to other contexts like your Java IDE and this forum.

The original data need to be decoded as UTF-8, rather than erroneously converted to text under CP-1252, stripped of controls, and decoded again as UTF-8. When copy-and-pasting, where do you copy from? Why not read that file with UTF-8 instead of CP-1252?

Upvotes: 1

Sandip Solanki
Sandip Solanki

Reputation: 743

Here's a solution that avoids performing the Charset lookup for every conversion:

import java.nio.charset.Charset;

private final Charset UTF8_CHARSET = Charset.forName("UTF-8");

String decodeUTF8(byte[] bytes) {
    return new String(bytes, UTF8_CHARSET);
}

byte[] encodeUTF8(String string) {
    return string.getBytes(UTF8_CHARSET);
}

second approach :

Convert from String to byte[]:

String s = "some text here";
byte[] b = s.getBytes("UTF-8");

Convert from byte[] to String:

byte[] b = {(byte) 99, (byte)97, (byte)116};
String s = new String(b, "US-ASCII");

You should, of course, use the correct encoding name. My examples used "US-ASCII" and "UTF-8", the two most common encodings.

Upvotes: 0

Related Questions