Not able to see Arabic characters after converting the file from ISO 8859-6 to UTF-8

Question

In my application, I am reading a file having some Arabic characters (encoding is ISO 8859-6) and I am converting it to UTF-8 encoding and writing back in a new file using BufferedWriter. However, in my newly generated file, I am unable to see the Arabic characters, instead few question marks are coming.

Snippet from my original File

Sample Data//لمند
Another line,
One more line/لمند

Snippet from the generated file

 Sample Data//????
 Another line,
 One more line/????

I am using below method for conversion:

private String convertCharSet(String data, String sourceCharacterCode, String destinationCharacterCode) throws UnsupportedEncodingException
{
        Charset charsetSource = Charset.forName(sourceCharacterCode);
        Charset charsetDestination = Charset.forName(destinationCharacterCode);
        ByteBuffer inputByteBuffer = ByteBuffer.wrap(data.getBytes(sourceCharacterCode));
        CharBuffer charBuffer = charsetSource.decode(inputByteBuffer);
        ByteBuffer outputByteBuffer = charsetDestination.encode(charBuffer);
        return new String(outputByteBuffer.array(), destinationCharacterCode);
}

I am using below method for writing into file

public static void writeToFile(String filePath, String data) throws IOException
{
    BufferedWriter out = null;
    try
    {
        out = new BufferedWriter(new FileWriter(new File(filePath)));
        out.write(data);
        out.flush();
    }
    finally
    {
        out.close();
    }
}

Observations

In notepad++, I opened the file in ISO 8859-6 format and I could see the Arabic characters. I converted it to UTF-8 using Convert to UTF-8 option and there I could see the Arabic characters after conversion.
I have debugged my program in eclipse, there before conversion I could see the Arabic characters and after conversion to UTF-8 also I could see the Arabic characters. But once the contents are written into the file, I am getting those ? marks instead of Arabic characters.

Note

In eclipse, I am using -Dfile.encoding=ISO-8859-6 as virtual argument.
I have seen ISO-8859-6 to UTF-8, but that does not resolve my problem.

Any help is greatly appreciated.

Joop Eggen · Accepted Answer

In Java (as opposed to other languages) text, String/Char/Reader/Writer is Unicode, being able to combine all scripts.

So the conversion must take place not between Strings, but between String and binary data, byte[]/InputStream/OutputStream.

Path sourcePath = Paths.get("C:/data/arab.txt");
byte[] sourceData = Files.readAllBytes(sourcePath);

String s = new String(sourceData, "ISO-8859-6");

byte[] targetData = s.getBytes(StandardCharsets.UTF_8);
Files.write(targetData, targetPath, StandardOpenOption.REPLACE_EXISTING);

As you see, it is conceptual easy in java - once one knows.

FileWriter/FileReader are old utility classes, that use the default platform encoding. Not portable. Only for local files.

In java 1.6 (without Exception handling):

File sourceFile = ...
File targetFile = ...
BufferedReader in = new BufferedReader(new InputStreamReader(
        new FileInputStream(sourceFile), "ISO-8859-6"));
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(
        new FileOuputStream(targetFile), "UTF-8"));
for (;;) {
    String line = in.readLine();
    if (line == null) {
        break;
    }
    out.write(line);
    out.write("
"); // Windows CR+LF.
}
out.close();
in.close();

Not able to see Arabic characters after converting the file from ISO 8859-6 to UTF-8

Answers (2)

Related Questions