Reputation: 823
I am using Apache POI to read the .docx
file and after some operations write in .csv
.
The .docx
file I am using is in french, but when I write the data in .csv
it convert some of french characters in special characters.
example Être un membre clé
converted to Être un membre clé
Below code is used to write the file
Path path = Paths.get(filePath);
BufferedWriter bw = Files.newBufferedWriter(path);
CSVWriter writer = new CSVWriter(bw);
writer.writeAll(data);
which use UTF-8
as default.
While debugging I have checked before writing to .csv
the data is as it is. but its get converted while writing? I have set default locale to Locale.FRENCH
Is I missed something?
Upvotes: 0
Views: 1696
Reputation: 61870
I suspect it is Excel
which reads the UTF-8
encoded CSV
as ANSI
. This happens when you simply open the CSV
in Excel
without using the text import wizard. Then Excel
always expects ANSI
if there is not a BOM
at the beginning of the file. If you would open the CSV
using a text editor which supports Unicode
, all will be correct.
Example:
import java.io.BufferedWriter;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.Files;
import java.util.Locale;
import java.util.List;
import java.util.ArrayList;
import com.opencsv.CSVWriter;
class DocxToCSV {
public static void main(String[] args) throws Exception {
Locale.setDefault(Locale.FRENCH);
List<String[]> data = new ArrayList<String[]>();
data.add(new String[]{"F1", "F2", "F3", "F4"});
data.add(new String[]{"Être un membre clé", "Être clé", "membre clé"});
data.add(new String[]{"Être", "un", "membre", "clé"});
Path path = Paths.get("test.csv");
BufferedWriter bw = Files.newBufferedWriter(path);
//bw.write(0xFEFF); bw.flush(); // write a BOM to the file
CSVWriter writer = new CSVWriter(bw, ';', '"', '"', "\r\n");
writer.writeAll(data);
writer.flush();
writer.close();
}
}
Now if you open the test.csv
using a text editor which supports Unicode
, all will be correct. But if you open the same file using Excel
it looks like:
Now we do the same but having
bw.write(0xFEFF); bw.flush(); // write a BOM to the file
active.
This results in Excel
like this when test.csv
is simply opened by Excel
:
Of course the better approach is always using Excel
's Text Import Wizard.
See also Javascript export CSV encoding utf-8 issue for the same problem.
Upvotes: 3
Reputation: 96
Être un membre clé "UTF8" = Être un membre clé "ANSI"
check the char code of how you are reading the final file.
Upvotes: 1