halem
halem

Reputation: 69

java convert String windows-1251 to utf8

Scanner sc = new Scanner(System.in);
    System.out.println("Enter text: ");
    String text = sc.nextLine();
    try {
        String result = new String(text.getBytes("windows-1251"), Charset.forName("UTF-8"));
        System.out.println(result);
    } catch (UnsupportedEncodingException e) {
        System.out.println(e);
    }

I'm trying change keyboard: input cyrylic keyboard, output latin. Example: qwerty +> йцукен

It doesn't work, can anyone tell me what i'm doing wrong?

Upvotes: 6

Views: 35744

Answers (2)

Joop Eggen
Joop Eggen

Reputation: 109613

First java text, String/char/Reader/Writer is internally Unicode, so it can combine all scripts. This is a major difference with for instance C/C++ where there is no such standard.

Now System.in is an InputStream for historical reasons. That needs an indication of encoding used.

Scanner sc = new Scanner(System.in, "Windows-1251");

The above explicitly sets the conversion for System.in to Cyrillic. Without this optional parameter the default encoding is taken. If that was not changed by the software, it would be the platform encoding. So this might have been correct too.

Now text is correct, containing the Cyrillic from System.in as Unicode.

You would get the UTF-8 bytes as:

byte[] bytes = text.getBytes(StandardCharsets.UTF_8);

The old "recoding" of text was wrong; drop this line. in fact not all Windows-1251 bytes are valid UTF-8 multi-byte sequences.

String result = text;

System.out.println(result);

System.out is a PrintStream, a rather rarely used historic class. It prints using the default platform encoding. More or less rely on it, that the default encoding is correct.

System.out.println(result);

For printing to an UTF-8 encoded file:

byte[] bytes = ("\uFEFF" + text).getBytes(StandardCharsets.UTF_8);
Path path = Paths.get("C:/Temp/test.txt");
Files.writeAllBytes(path, bytes);

Here I have added a Unicode BOM character in front, so Windows Notepad may recognize the encoding as UTF-8. In general one should evade using a BOM. It is a zero-width space (=invisible) and plays havoc with all kind of formats: CSV, XML, file concatenation, cut-copy-paste.

Upvotes: 7

v010dya
v010dya

Reputation: 5858

The reason why you have gotten the answer to a different question, and nobody answered yours, is because your title doesn't fit the question. You were not attempting to convert between charsets, but rather between keyboard layouts.

Here you shouldn't worry about character layout at all, simply read the line, convert it to an array of characters, go through them and using a predefined map convert these.

The code will be something like this:

Map<char, char> table = new TreeMap<char, char>();
table.put('q', 'й');
table.put('Q', 'Й');
table.put('w', 'ц');
// .... etc

String text = sc.nextLine();
char[] cArr = text.toCharArray();
for(int i=0; i<cArr.length; ++i)
{
  if(table.containsKey(cArr[i]))
  {
    cArr[i] = table.get(cArr[i]);
  }
}
text = new String(cArr);
System.out.println(text);

Now, i don't have time to test that code, but you should get the idea of how to do your task.

Upvotes: 1

Related Questions