Reputation: 3714
In my application, I am reading a file having some Arabic characters (encoding is ISO 8859-6
) and I am converting it to UTF-8
encoding and writing back in a new file using BufferedWriter
. However, in my newly generated file, I am unable to see the Arabic characters, instead few question marks are coming.
Snippet from my original File
Sample Data//لمند
Another line,
One more line/لمند
Snippet from the generated file
Sample Data//????
Another line,
One more line/????
I am using below method for conversion:
private String convertCharSet(String data, String sourceCharacterCode, String destinationCharacterCode) throws UnsupportedEncodingException
{
Charset charsetSource = Charset.forName(sourceCharacterCode);
Charset charsetDestination = Charset.forName(destinationCharacterCode);
ByteBuffer inputByteBuffer = ByteBuffer.wrap(data.getBytes(sourceCharacterCode));
CharBuffer charBuffer = charsetSource.decode(inputByteBuffer);
ByteBuffer outputByteBuffer = charsetDestination.encode(charBuffer);
return new String(outputByteBuffer.array(), destinationCharacterCode);
}
I am using below method for writing into file
public static void writeToFile(String filePath, String data) throws IOException
{
BufferedWriter out = null;
try
{
out = new BufferedWriter(new FileWriter(new File(filePath)));
out.write(data);
out.flush();
}
finally
{
out.close();
}
}
Observations
In notepad++
, I opened the file in ISO 8859-6
format and I could
see the Arabic characters. I converted it to UTF-8
using Convert to UTF-8
option and there I could see the Arabic characters after conversion.
I have debugged my program in eclipse
, there before conversion I could see the Arabic characters and after conversion to UTF-8
also I could see the Arabic characters. But once the contents are written into the file, I am getting those ?
marks instead of Arabic characters.
Note
-Dfile.encoding=ISO-8859-6
as virtual
argument. Any help is greatly appreciated.
Upvotes: 0
Views: 3276
Reputation: 109547
In Java (as opposed to other languages) text, String/Char/Reader/Writer
is Unicode, being able to combine all scripts.
So the conversion must take place not between Strings, but between String and binary data, byte[]/InputStream/OutputStream
.
Path sourcePath = Paths.get("C:/data/arab.txt");
byte[] sourceData = Files.readAllBytes(sourcePath);
String s = new String(sourceData, "ISO-8859-6");
byte[] targetData = s.getBytes(StandardCharsets.UTF_8);
Files.write(targetData, targetPath, StandardOpenOption.REPLACE_EXISTING);
As you see, it is conceptual easy in java - once one knows.
FileWriter/FileReader are old utility classes, that use the default platform encoding. Not portable. Only for local files.
In java 1.6 (without Exception handling):
File sourceFile = ...
File targetFile = ...
BufferedReader in = new BufferedReader(new InputStreamReader(
new FileInputStream(sourceFile), "ISO-8859-6"));
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(
new FileOuputStream(targetFile), "UTF-8"));
for (;;) {
String line = in.readLine();
if (line == null) {
break;
}
out.write(line);
out.write("\r\n"); // Windows CR+LF.
}
out.close();
in.close();
Upvotes: 5
Reputation: 18405
You writeToFile
method is broken. You are opening an imlicit Writer
without specifying the encoding. Standard platform encoding will be used. You files will be broken. Use a Writer
which accepts one encoding.
Upvotes: 0