Hitesh Ghuge
Hitesh Ghuge

Reputation: 823

Characters get converted into special characters

I am using Apache POI to read the .docx file and after some operations write in .csv. The .docx file I am using is in french, but when I write the data in .csv it convert some of french characters in special characters. example Être un membre clé converted to Être un membre clé

Below code is used to write the file

        Path path = Paths.get(filePath);
        BufferedWriter bw = Files.newBufferedWriter(path);
        CSVWriter writer = new CSVWriter(bw);
        writer.writeAll(data);

which use UTF-8 as default.

While debugging I have checked before writing to .csv the data is as it is. but its get converted while writing? I have set default locale to Locale.FRENCH

Is I missed something?

Upvotes: 0

Views: 1696

Answers (2)

Axel Richter
Axel Richter

Reputation: 61870

I suspect it is Excel which reads the UTF-8 encoded CSV as ANSI. This happens when you simply open the CSV in Excel without using the text import wizard. Then Excel always expects ANSI if there is not a BOM at the beginning of the file. If you would open the CSV using a text editor which supports Unicode, all will be correct.

Example:

import java.io.BufferedWriter;

import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.Files;

import java.util.Locale;
import java.util.List;
import java.util.ArrayList;

import com.opencsv.CSVWriter;

class DocxToCSV {

 public static void main(String[] args) throws Exception {

  Locale.setDefault(Locale.FRENCH);

  List<String[]> data = new ArrayList<String[]>();
  data.add(new String[]{"F1", "F2", "F3", "F4"});
  data.add(new String[]{"Être un membre clé", "Être clé", "membre clé"});
  data.add(new String[]{"Être", "un", "membre", "clé"});

  Path path = Paths.get("test.csv");
  BufferedWriter bw = Files.newBufferedWriter(path);

  //bw.write(0xFEFF); bw.flush(); // write a BOM to the file

  CSVWriter writer = new CSVWriter(bw, ';', '"', '"', "\r\n");
  writer.writeAll(data);
  writer.flush();
  writer.close();

 }
}

Now if you open the test.csv using a text editor which supports Unicode, all will be correct. But if you open the same file using Excel it looks like:

enter image description here

Now we do the same but having

bw.write(0xFEFF); bw.flush(); // write a BOM to the file

active.

This results in Excel like this when test.csv is simply opened by Excel:

enter image description here

Of course the better approach is always using Excel's Text Import Wizard.

See also Javascript export CSV encoding utf-8 issue for the same problem.

Upvotes: 3

cj rogers
cj rogers

Reputation: 96

Être un membre clé "UTF8" = Être un membre clé "ANSI"

check the char code of how you are reading the final file.

Upvotes: 1

Related Questions