Lezan
Lezan

Reputation: 707

Error reading UTF-8 file in Java

I am trying to read in some sentences from a file that contains unicode characters. It does print out a string but for some reason it messes up the unicode characters

This is the code I have:

public static String readSentence(String resourceName) {

    String sentence = null;
    try {
        InputStream refStream = ClassLoader
                .getSystemResourceAsStream(resourceName);
        BufferedReader br = new BufferedReader(new InputStreamReader(
                refStream, Charset.forName("UTF-8")));
        sentence = br.readLine();
    } catch (IOException e) {
        throw new RuntimeException("Cannot read sentence: " + resourceName);
    }
    return sentence.trim();
}

Upvotes: 2

Views: 2991

Answers (3)

Stephen C
Stephen C

Reputation: 719596

The problem is probably in the way that the string is being output.

I suggest that you confirm that you are correctly reading the Unicode characters by doing something like this:

for (char c : sentence.toCharArray()) {
    System.err.println("char '" + ch + "' is unicode codepoint " + ((int) ch)));
}

and see if the Unicode codepoints are correct for the characters that are being messed up. If they are correct, then the problem is output side: if not, then input side.

Upvotes: 2

Roman
Roman

Reputation: 66216

One of the most annoying reason could be... your IDE settings.

If your IDE default console encoding is something like latin1 then you'll be struggling very long with different variations of java code but nothing help untill you correctly set some IDE options.

Upvotes: 1

Mot
Mot

Reputation: 29600

First, you could create the InputStreamReader as

new InputStreamReader(refStream, "UTF-8")

Also, you should verify if the resource really contains UTF-8 content.

Upvotes: 1

Related Questions