Reputation: 431

Character encoding in csv

We have a requirement of picking the data from Oracle DB table and dump that data into a csv file and a plain pipe seperated text file. Give a link to user on application so user can view the generated csv/text files.

As lot of parsing was involved so we wrote a Unix shell script and are calling it from out Struts/J2ee application.

Earlier we were loosing the Chinese and Roman chars in the generated files and the generated file were having us-ascii charset(cheked using-> file -i). Later we used NLS_LANG=AMERICAN_AMERICA.AL32UTF8 and this gave us utf-8 format files.

But still the characters were gibberish, so again we tried iconv command and converted utf-8 files to utf-16le charset. iconv -f utf-8 -t utf-16le $recordFile > $tempFile

This works fine for the generated text file. But with CSV the Chinese and Roman chars are still not correct. Now if we open this csv file in a Notepad and give a newline by pressing Enter key from keyboard, save it. Open it with MS-Excel, all characters are coming fine including the Chinese and Romans but now the text is in single line for each row instead of columns.

Not sure what's going on.

Java code

PrintWriter out = servletResponse.getWriter(); 
servletResponse.setContentType("application/vnd.ms-excel; charset=UTF-8");
servletResponse.setCharacterEncoding("UTF-8");
servletResponse.setHeader("Content-Disposition","attachment; filename="+ fileName.toString());                   
FileInputStream fileInputStream = new FileInputStream(fileLoc + fileName);  
int i;   
while ((i=fileInputStream.read()) != -1) {  
 out.write(i);   
} 
fileInputStream.close();   
out.close();

Please let me know if i missed out any details. Thanks to all for taking out time to go through this.

Upvotes: 0

Answers (2)

pranav

Reputation: 431

Was able to solve it out. First as mentioned by Aaron removed UTF-16LE encoding to avoid future issues and encoded files to UTF-8. Changed the PrintWriter in Java code to OutputStream and was able to see the correct characters in my text file.

CSV was still showing garbage. Came to know that we need to prepend EF BB BF at the beginning of file as the BOM aware software like MS-Excel needs it. So changing the Java code as below did the trick for csv.

OutputStream out = servletResponse.getOutputStream();
os.write(239); //0xEF
os.write(187); //0xBB
out.write(191); //0xBF               
FileInputStream fileInputStream = new FileInputStream(fileLoc + fileName);  
int i;   
while ((i=fileInputStream.read()) != -1) {  
    out.write(i);   
} 
fileInputStream.close();  
out.flush();
out.close();

Upvotes: 3

Aaron Digulla

Reputation: 328594

As always with Unicode problems, every single step of the transformation chain must work perfectly. If you make a mistake in one place, data will be silently corrupted. There is no easy way to figure out where it happens, you have to debug the code or write unit tests.

The Java code above only works if the file actually contains UTF-8 encoded data; it doesn't "magically" figure out what's in the file and converts it to UTF-8. So if the file already contains garbage, you just slap a "this is UTF-8" label on it but it's still garbage.

That means for you that you need to create test cases which take known test data and move that through every step of the chain: Inserting into database, reading from the database, writing to CSV, writing to the text file, reading those files and download to the user.

For each step, you need to write unit tests which takes a known Unicode string like abc öäü and processes it and then check the result. To make it easier to input in Java code, use "abc \u00f6\u00e4\u00fc" You may also want to add spaces at the beginning and end of the string to see whether they are properly preserved or not.

file -i doesn't help you much here since it just makes a guess what the file contains. There is no indicator (data or metadata) in a text file which says "this is UTF-8". UTF-16 supports a BOM header for this but almost no one uses UTF-16, so many tools don't support it (properly).

Upvotes: 1

Character encoding in csv

Answers (2)

Related Questions