Somnath Musib
Somnath Musib

Reputation: 3714

Not able to see Arabic characters after converting the file from ISO 8859-6 to UTF-8

In my application, I am reading a file having some Arabic characters (encoding is ISO 8859-6) and I am converting it to UTF-8 encoding and writing back in a new file using BufferedWriter. However, in my newly generated file, I am unable to see the Arabic characters, instead few question marks are coming.

Snippet from my original File

Sample Data//لمند
Another line,
One more line/لمند

Snippet from the generated file

 Sample Data//????
 Another line,
 One more line/????

I am using below method for conversion:

private String convertCharSet(String data, String sourceCharacterCode, String destinationCharacterCode) throws UnsupportedEncodingException
{
        Charset charsetSource = Charset.forName(sourceCharacterCode);
        Charset charsetDestination = Charset.forName(destinationCharacterCode);
        ByteBuffer inputByteBuffer = ByteBuffer.wrap(data.getBytes(sourceCharacterCode));
        CharBuffer charBuffer = charsetSource.decode(inputByteBuffer);
        ByteBuffer outputByteBuffer = charsetDestination.encode(charBuffer);
        return new String(outputByteBuffer.array(), destinationCharacterCode);
}

I am using below method for writing into file

public static void writeToFile(String filePath, String data) throws IOException
{
    BufferedWriter out = null;
    try
    {
        out = new BufferedWriter(new FileWriter(new File(filePath)));
        out.write(data);
        out.flush();
    }
    finally
    {
        out.close();
    }
}

Observations

  1. In notepad++, I opened the file in ISO 8859-6 format and I could see the Arabic characters. I converted it to UTF-8 using Convert to UTF-8 option and there I could see the Arabic characters after conversion.

  2. I have debugged my program in eclipse, there before conversion I could see the Arabic characters and after conversion to UTF-8 also I could see the Arabic characters. But once the contents are written into the file, I am getting those ? marks instead of Arabic characters.

Note

Any help is greatly appreciated.

Upvotes: 0

Views: 3276

Answers (2)

Joop Eggen
Joop Eggen

Reputation: 109547

In Java (as opposed to other languages) text, String/Char/Reader/Writer is Unicode, being able to combine all scripts.

So the conversion must take place not between Strings, but between String and binary data, byte[]/InputStream/OutputStream.

Path sourcePath = Paths.get("C:/data/arab.txt");
byte[] sourceData = Files.readAllBytes(sourcePath);

String s = new String(sourceData, "ISO-8859-6");

byte[] targetData = s.getBytes(StandardCharsets.UTF_8);
Files.write(targetData, targetPath, StandardOpenOption.REPLACE_EXISTING);

As you see, it is conceptual easy in java - once one knows.

FileWriter/FileReader are old utility classes, that use the default platform encoding. Not portable. Only for local files.


In java 1.6 (without Exception handling):

File sourceFile = ...
File targetFile = ...
BufferedReader in = new BufferedReader(new InputStreamReader(
        new FileInputStream(sourceFile), "ISO-8859-6"));
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(
        new FileOuputStream(targetFile), "UTF-8"));
for (;;) {
    String line = in.readLine();
    if (line == null) {
        break;
    }
    out.write(line);
    out.write("\r\n"); // Windows CR+LF.
}
out.close();
in.close();

Upvotes: 5

Michael-O
Michael-O

Reputation: 18405

You writeToFile method is broken. You are opening an imlicit Writer without specifying the encoding. Standard platform encoding will be used. You files will be broken. Use a Writer which accepts one encoding.

Upvotes: 0

Related Questions