Mohammed Deifallah
Mohammed Deifallah

Reputation: 1330

Java IO with UTF characters

I have a weird problem with files.

I intend to modify the timing of an .srt file, but writing the new file seems to be a weird task.

Here's a sample code I wrote:

import java.io.*;
import java.nio.charset.Charset;

public class ReaderWriter {
    public static void main(String[] args) throws IOException {
        InputStream inputStream = new FileInputStream("D:\\E\\Movies\\English\\1960's\\TheApartment1960.srt");
        Reader reader = new InputStreamReader(inputStream,
                Charset.forName("UTF-8"));
        OutputStream outputStream = new FileOutputStream("output.srt");
        Writer writer = new OutputStreamWriter(outputStream,
                Charset.forName("UTF-8"));

        int data = reader.read();
        while (data != -1) {
            char theChar = (char) data;
            writer.write(theChar);
            data = reader.read();
        }
        reader.close();
        writer.close();
    }
}

This is an image from the original file: enter image description here

However, the resulted file seems like: enter image description here

I searched a lot for a solution but in vain. Any help, please.

Upvotes: 0

Views: 97

Answers (1)

skomisa
skomisa

Reputation: 17343

First a few points:

  • There is nothing wrong with your Java code. If I use it to read an input file containing Arabic text encoded in UTF-8 it creates the output file encoded in UTF-8 with no problems.
  • I don't think there is a font issue. Since you can successfully display the content of the input file there is no reason you cannot also successfully display the content of a valid output file.
  • Those black diamonds with question marks in the output file are replacement characters which are "used to replace an incoming character whose value is unknown or unrepresentable in Unicode". This indicates that the input file you are reading is not UTF-8 encoded, even though the code explicitly states that it is. I can reproduce similar results to yours if the input file is UTF-16 encoded, but specified as UTF-8 in the code.
  • Alternatively, if the input file truly is UTF-8 encoded, specify it as UTF-16 in the code. For example, here is a valid UTF-8 input file with some Arabic text where the code (incorrectly) stated Reader reader = new InputStreamReader(inputStream, Charset.forName("UTF-16"));:

    يونكود في النظم القائمة وفيما يخص التطبيقات الحاسوبية، الخطوط، تصميم النصوص والحوسبة متعددة اللغات.

    And here is the output file, containing the replacement characters because the input stream of the UTF-8 file was incorrectly processed as UTF-16:

    ���⃙臙訠���ꟙ蓙苘Ꟙꛙ藘ꤠ���諘께딠�����ꟙ蓘귘Ꟙ동裘꣙諘꧘谠����꫘뗙藙諙蔠���⃙裘ꟙ蓘귙裘돘꣘ꤠ���⃘ꟙ蓙蓘뫘Ꟙꨮ�

Given all that, simply ensuring that the encoding of the input file is specified correctly in the InputStreamReader() constructor should solve your problem. To verify this, just create another input file and save it with UTF-8 character encoding, then run your code. If it works then you know that the problem was the that the encoding of input file was not UTF-8.

Upvotes: 2

Related Questions