JohnK
JohnK

Reputation: 7337

How to read Unicode Greek from the keyboard?

I'm trying to write a Greek vocabulary quiz program. The problem is that I can't get it to interpret the input characters properly. Below is some sample code I put together to demonstrate the problem. (If you don't want to go through the trouble of setting up Greek input for your machine, when the program asks for the word, you can just copy and paste the Greek string. In case it's significant, I'm running this through Eclipse on 64-bit Win7.)

import java.io.BufferedReader;
import java.io.InputStreamReader;

public class GreekKeyboardExample {

    public static void main(String[] args) {
        String word = "αβγδεζηθικλμνξοπρσςτυφχψω";
        System.out.println("\n\n" + word + "\n");
        String answer = getInput("Type the word above: ");

        System.out.println("\nThis is what the computer took from the keyboard:");  
        printCharsAndCode(answer);

        System.out.println("\nThis is what it should look like:");  
        printCharsAndCode(word);
    }

    private static String getInput(String prompt) {
        System.out.print(prompt);
        System.out.flush();

        try {
            BufferedReader in = new BufferedReader(new InputStreamReader(System.in, "UTF8"));
            return in.readLine();
        } 
        catch (Exception e) {
            return "Error: " + e.getMessage();
        } 
    }

    /* prints the character and its (unicode) code */
    public static void printCharsAndCode(String str) {
//      int len = str.length();
        char[] c = str.toCharArray();
        System.out.println(str);
        for (char d : c) {
            System.out.print("    " + d + " ");
            if (Character.getType(d) == 6) System.out.print(" "); //extra space to make combining diacritics display rightly (NON_SPACING_MARK)
        }
        System.out.println();
        for (char d : c) {
            int ic = (int) d;
            System.out.printf("%1$#05x ", (int) d);
        }
        System.out.println();
    }
}

Here's the output:

αβγδεζηθικλμνξοπρσςτυφχψω

Type the word above: αβγδεζηθικλμνξοπρσςτυφχψω

This is what the computer took from the keyboard:
αβγδεζηθικλμνξοπ�σςτυφχψω
    Î     ±     Î     ²     Î     ³     Î     ´     Î     µ     Î     ¶     Î     ·     Î     ¸     Î     ¹     Î     º     Î     »     Î     ¼     Î     ½     Î     ¾     Î     ¿     Ï     €     Ï     �     Ï     ƒ     Ï     ‚     Ï     „     Ï     …     Ï     †     Ï     ‡     Ï     ˆ     Ï     ‰ 
0x0ce 0x0b1 0x0ce 0x0b2 0x0ce 0x0b3 0x0ce 0x0b4 0x0ce 0x0b5 0x0ce 0x0b6 0x0ce 0x0b7 0x0ce 0x0b8 0x0ce 0x0b9 0x0ce 0x0ba 0x0ce 0x0bb 0x0ce 0x0bc 0x0ce 0x0bd 0x0ce 0x0be 0x0ce 0x0bf 0x0cf 0x20ac 0x0cf 0xfffd 0x0cf 0x192 0x0cf 0x201a 0x0cf 0x201e 0x0cf 0x2026 0x0cf 0x2020 0x0cf 0x2021 0x0cf 0x2c6 0x0cf 0x2030 

This is what it should look like:
αβγδεζηθικλμνξοπρσςτυφχψω
    α     β     γ     δ     ε     ζ     η     θ     ι     κ     λ     μ     ν     ξ     ο     π     ρ     σ     ς     τ     υ     φ     χ     ψ     ω 
0x3b1 0x3b2 0x3b3 0x3b4 0x3b5 0x3b6 0x3b7 0x3b8 0x3b9 0x3ba 0x3bb 0x3bc 0x3bd 0x3be 0x3bf 0x3c0 0x3c1 0x3c3 0x3c2 0x3c4 0x3c5 0x3c6 0x3c7 0x3c8 0x3c9 


Can anyone advise me on how to fix the problem?

Upvotes: 3

Views: 4284

Answers (3)

JohnK
JohnK

Reputation: 7337

I reported it as a bug, and it's just been confirmed as such:

"I confirm that this is a bug which will be fixed in the next release (Kepler)."

I appreciate everyone's input here.

Upvotes: 0

bmargulies
bmargulies

Reputation: 100050

Look at the 'Common' tab of the Eclipse Run/Debug configuration for the encoding. You can type in the correct code page or ISO code.

Upvotes: 0

QuantumMechanic
QuantumMechanic

Reputation: 13946

Your code assumes that that the bytes coming in via System.in have been encoded using UTF-8. Unless you've set your platform's default encoding to UTF-8 that will be very unlikely.

What happens if instead of UTF-8 you specify the encoding that matches your platform's default encoding?

For example, my Linux machine does have its default encoding set to UTF-8 and when I run your program I get the "right" answer. However, I did have to change the definition of word to be:

String word = "\u03b1\u03b2\u03b3\u03b4\u03b5\u03b6\u03b7\u03b8\u03b9\u03ba\u03bb\u03bc\u03bd\u03be\u03bf\u03c0\u03c1\u03c3\u03c2\u03c4\u03c5\u03c6\u03c7\u03c8\u03c9";

because when I try to cut-and-paste the Greek letters into my editor, my editor does not understand them. Entering them as unicode escape sequences gives exactly the same string as if I had an editor that understood Greek letters typed into it.

So when I run your program with that change I get:

αβγδεζηθικλμνξοπρσςτυφχψω

Type the word above: αβγδεζηθικλμνξοπρσςτυφχψω

This is what the computer took from the keyboard:
αβγδεζηθικλμνξοπρσςτυφχψω
    α     β     γ     δ     ε     ζ     η     θ     ι     κ     λ     μ     ν     ξ     ο     π     ρ     σ     ς     τ     υ     φ     χ     ψ     ω 
0x3b1 0x3b2 0x3b3 0x3b4 0x3b5 0x3b6 0x3b7 0x3b8 0x3b9 0x3ba 0x3bb 0x3bc 0x3bd 0x3be 0x3bf 0x3c0 0x3c1 0x3c3 0x3c2 0x3c4 0x3c5 0x3c6 0x3c7 0x3c8 0x3c9 

This is what it should look like:
αβγδεζηθικλμνξοπρσςτυφχψω
    α     β     γ     δ     ε     ζ     η     θ     ι     κ     λ     μ     ν     ξ     ο     π     ρ     σ     ς     τ     υ     φ     χ     ψ     ω 
0x3b1 0x3b2 0x3b3 0x3b4 0x3b5 0x3b6 0x3b7 0x3b8 0x3b9 0x3ba 0x3bb 0x3bc 0x3bd 0x3be 0x3bf 0x3c0 0x3c1 0x3c3 0x3c2 0x3c4 0x3c5 0x3c6 0x3c7 0x3c8 0x3c9 

The reason why it worked for me is that my computer is set to use UTF-8. So when I type into a terminal, that terminal program and/or operating system will transform those characters to bytes using UTF-8, and when Java reads those bytes using UTF-8, all is great.

But if my computer was set to ISO-8859-1, then typing at the terminal would have generated bytes that make no sense in UTF-8 and "garbage" would have been read from the keyboard by the program. But if the program was changed to use ISO-8859-1, then it might have worked. (I say "might" because I don't know if ISO-8859-1 can validly encode Greek letters into bytes.). So for your program to work you need two things to be true:

  1. The encoding you use when wrapping the Reader around System.in must use the same encoding that your computer uses to transform bytes to characters when you type at the terminal.
  2. Whatever encoding your computer is using, it needs to be able to encode Greek letters to bytes sequences that are valid in that encoding.

Upvotes: 5

Related Questions