Paul
Paul

Reputation: 982

Incorrect printing of non-eglish characters with Java

I thought this was only an issue with Python 2 but have run into a similar issue now with java (Windows 10, JDK8).

My searches have lead to little resolution so far.

I read from 'stdin' input stream this value: Viļāni. When I print it to console I get this: Vi????ni.

Relevant code snippets are as follows:

   BufferedReader in = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));

    ArrayList<String> corpus = new ArrayList<String>();
    String inputString = null;
    while ((inputString = in.readLine()) != null) {
        corpus.add(inputString);
    }
    String[] allCorpus = new String[corpus.size()];
    allCorpus = corpus.toArray(allCorpus);
    for (String line : allCorpus) {
        System.out.println(line);
    }

Further expansion on my problem as follows:

I read a file containing the following 2 lines: を Sōten_Kōro When I read this from disk and output to a second file I get the following output:

を S�ten_K�ro When I read the file from stdin using cat testinput.txt | java UTF8Tester I get the following output:

??? S??ten_K??ro

Both are obviously wrong. I need to be able to print the correct characters to console and file. My sample code is as follows:

public class UTF8Tester {

    public static void main(String args[]) throws Exception {
        BufferedReader stdinReader = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));
        String[] stdinData = readLines(stdinReader);
        printToFile(stdinData, "stdin_out.txt");

        BufferedReader fileReader = new BufferedReader(new FileReader("testinput.txt"));
        String[] fileData = readLines(fileReader);
        printToFile(fileData, "file_out.txt");

    }

    private static void printToFile(String[] data, String fileName)
            throws FileNotFoundException, UnsupportedEncodingException {
        PrintWriter writer = new PrintWriter(fileName, "UTF-8");
        for (String line : data) {
            writer.println(line);
        }
        writer.close();
    }

    private static String[] readLines(BufferedReader reader) throws IOException {
        ArrayList<String> corpus = new ArrayList<String>();
        String inputString = null;

        while ((inputString = reader.readLine()) != null) {
            corpus.add(inputString);
        }
        String[] allCorpus = new String[corpus.size()];
        return corpus.toArray(allCorpus);
    }

}

Really stuck here and help would really be appreciated! Thanks in advance. Paul

Upvotes: 1

Views: 725

Answers (1)

Joop Eggen
Joop Eggen

Reputation: 109547

  • System.in/out will use the default Windows character set.
  • Java String will use Unicode internally.
  • FileReader/FileWriter are old utility classes that use the default character set, hence they are for non-portable local files only.

The error you saw, was a special character as two bytes UTF-8 sequence, but every (special UTF-8) byte interpreted as the default single byte encoding, but with a value not present, hence twice a ? substitution.

  • Required is that the character can be entered on System.in in the default charset.
  • Then the String was converted from the default charset.
  • Writing it to file in UTF-8 needs to specify UTF-8.

Hence:

    BufferedReader stdinReader = new BufferedReader(new InputStreamReader(System.in));
    String[] stdinData = readLines(stdinReader);
    printToFile(stdinData, "stdin_out.txt");

    Path path = Paths.get("testinput-utf8.txt");
    List<String> lines = Files.readAllLines(path); // Here the default is UTF-8!

    Path path = Paths.get("testinput-winlatin1.txt");
    List<String> lines = Files.readAllLines(path, "Windows-1252");

    Files.write(lines, Paths.get("file_out.txt"), StandardCharsets.UTF_8);

To check whether your current computer system handles Japanese:

System.out.println("Hiragana letter Wo '\u3092'."); // Either を or ?.

Seeing ? the conversion to the default system encoding could not deliver. を is U+3092, u-encoded as ASCII with \u3092.

To create an UTF-8 text under Windows:

Files.write(Paths.get("out-utf8.txt"),
    "\uFEFFHiragana letter Wo '\u3092'.".getBytes(StandardCharsets.UTF_8));

Here I use an ugly (generally unneeded) BOM marker char \uFEFF (a zero-width space) that will let Windows Notepad recognize the text being in UTF-8.

Upvotes: 2

Related Questions