Reputation: 591
I have an xlsx file that I need to convert to csv, I used openpyxl module along with unicodecsv for this. My problem is that while writing some files I am getting some junk characters in output. Details below
One of my file has unicode code point u'\xa0' in it which corresponds to NON BREAK SPACE, but when converted to csv my file shows the  instead of space. While printing the same data on console using Python GUI it prints perfectly without any Â. What am I doing wrong here? any help is appreciated.
Sample Code:
import unicodecsv
from openpyxl import load_workbook
xlsx_file=load_workbook('testfile.xlsx',use_iterators=True)
with open('test_utf.csv','wb') as open_file:
csv_file=unicodecsv.writer(open_file)
sheet=xls_file.get_active_sheet()
for row in sheet.iter_rows():
csv_file.writerow(cell.internal_value for cell in row)
P.S: The type of data written is Unicode.
Upvotes: 3
Views: 735
Reputation: 2914
Okay, so what is going on is that Excel likes to assume that you are using the currently configured codepage. You have a couple of options:
Write your data in that codepage. This requires however that you know which one your users will be using.
Load the csv file using the "import data" menu option. If you are relying on your users to do this, don't. Most people will not be willing to do this.
Use a different program that will accept unicode in csv by default, such as Libre Office.
Add a BOM to the beginning of the file to get Excel to recognise utf-8. This may break in other programs.
Since this is for your personal use, if you are only ever going to use Excel, then appending a byte order marker to the beginning is probably the easiest solution.
Upvotes: 2
Reputation: 177554
Microsoft likes byte-order marks in its text files. Even though a BOM doesn't make sense with UTF-8, it is used as a signature to let Excel know the file is encoded in UTF-8.
Make sure to generate your .csv as UTF-8 with BOM. I created the following using Notepad++:
English,Chinese
American,美国人
Chinese,中国人
The result saved with BOM:
The result without BOM:
Upvotes: 1