purnendu
purnendu

Reputation: 51

to read unicode character in java

i am trying to read Unicode characters from a text file saved in utf-8 using java my text file is as follows

अ, अदेबानि ,अन, अनसुला, अनसुलि, अनफावरि, अनजालु, अनद्ला, अमा, अर, अरगा, अरगे, अरन, अराय, अलखद, असे, अहा, अहिंसा, अग्रं, अन्थाइ, अफ्रि, बियन, खियन, फियन, बन, गन, थन, हर, हम, जम, गल, गथ, दरसे, दरनै, थनै, थथाम, सथाम, खफ, गल, गथ, मिख, जथ, जाथ, थाथ, दद, देख, न, नेथ, बर, बुंथ, बिथ, बिख, बेल, मम, आ, आइ, आउ, आगदा, आगसिर

i have tried with the code as followed

import java.io.*;
import java.util.*;
import java.lang.*;
public class UcharRead
{
    public static void main(String args[])
    {
        try
        {
            String str;
            BufferedReader bufReader = new BufferedReader( new InputStreamReader(new FileInputStream("research_words.txt"), "UTF-8"));
            while((str=bufReader.readLine())!=null)
            {
                System.out.println(str);
            }
        }
        catch(Exception e)
        {
        }
    }
}

getting out put as ???????????????????????? can anyone help me

Upvotes: 5

Views: 10620

Answers (3)

Juned Ahsan
Juned Ahsan

Reputation: 68715

If you are reading the text properly using UTF-8 encoding then make sure that your console also supports UTF-8. In case you are using eclipse then you can enable UTF-8 encoding foryour console by:

Run Configuration->Common -> Encoding -> Select UTF 8

Here is the eclipse screenshot.

enter image description here

Upvotes: 6

Thilo
Thilo

Reputation: 262834

You are (most likely) reading the text correctly, but when you write it out, you also need to enable UTF-8. Otherwise every character that cannot be printed in your default encoding will be turned into question marks.

Try writing it to a File instead of System.out (and specify the proper encoding):

Writer w = new OutputStreamWriter(
   new FileOutputStream("x.txt"), "UTF-8");

Upvotes: 9

Jon Skeet
Jon Skeet

Reputation: 1503489

You're reading it correctly - the problem is almost certainly just that your console can't handle the text. The simplest way to verify this is to print out each char within the string. For example:

public static void dumpString(String text) {
    for (int i = 0; i < text.length(); i++) {
        char c = text.charAt(i);
        System.out.printf("%c - %04x\n", c, (int) c);
    }
}

You can then verify that each character is correct using the Unicode code charts.

Once you've verified that you're reading the file correctly, you can then work on the output side of things - but it's important to try to focus on one side of it at a time. Trying to diagnose potential failures in both input and output encodings at the same time is very hard.

Upvotes: 5

Related Questions