user1841718
user1841718

Reputation: 187

Reading file with charset encoding

I am trying to write Arabic word in windows Notepad by buffered output stream in java and after writing the charset encoding for notepad become UTF-8 so it is obvious the default charset for writing file in java is UTF-8 but the wonder when I read it by buffered input stream , it is not read by UTF-8 encoding because when reading it the result is strange symbols

enter code here
class writeFile extends BufferedOutputStream {
public writeFile(OutpuStream out){
 super(out);
  }

     public static void main(String arg[])
     { writeFile out=new writeFile(new FileOutputStream(new  
      File("path_String")));

        out.write("مكتبة".getByte());
          }}

it is ok written as it is but when read :

enter code here
    class readFile extends BufferedInputStream {
public readFile(InputStream In){
 super(In);
  }

     public static void main(String arg[])
     { readFile in=new readFile(new FileInputStream(new  
      File("path_String")));

         int c;
           while((c=in.read()!=-1)
                 System.out.print((char)c);
          }} 

the result is not as in file as written before : ÙÙتبة

so is this mean in writing java uses UTF-8 encoding and when in reading uses another encoding ?

Upvotes: 0

Views: 1758

Answers (1)

Mad Physicist
Mad Physicist

Reputation: 114488

The issue is not that it it not reading with UTF-8, it's that you are trashing the encoding in your read operation. FileInputStream.read() is very clearly stated to read one byte at a time. Bytes converted to characters are not going to work if you have multi-byte sequences in your file (which you almost certainly do since it is in Arabic).

As you figured out, the easiest solution is to use InputStreamReader, which reads the bytes from an underlying FileInputStream (or other stream), and correctly decodes the character sequences. The default encoding here is of course the same as for the writer:

An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset. The charset that it uses may be specified by name or may be given explicitly, or the platform's default charset may be accepted.

You can do a similar thing by reading the entire file into a byte buffer and then decoding the entire thing using something like String(byte[]). The results should be identical if you read the entire file because now the decoder will have enough information to correctly parse out all the multi-byte characters.

There is a reference on encoding and decoding that I found very useful in understanding the subject: http://kunststube.net/encoding/

Upvotes: 1

Related Questions