Reputation: 179
I am trying to save the string the United Nations’ Sustainable Development Goals (SDGs)
into a csv.
If I use utf-8 as the encoding, the apostrophe in the string gets converted to an ASCII char
import csv
str_ = "the United Nations’ Sustainable Development Goals (SDGs)"
#write to a csv file
with open("output.csv", 'w', newline='', encoding='utf-8') as file:
csv_writer = csv.writer(file,delimiter=",")
csv_writer.writerow([str_])
#read from the csv file created above
with open("output.csv",newline='') as file:
csv_reader = csv.reader(file)
for row in csv_reader:
print(row)
The result I get is
['the United Nations’ Sustainable Development Goals (SDGs)']
If I use cp1252 as the encoding, the apostrophe in the string is preserved as you can see in the result
import csv
str_ = "the United Nations’ Sustainable Development Goals (SDGs)"
#write to a csv file
with open("output.csv", 'w', newline='', encoding='cp1252') as file:
csv_writer = csv.writer(file,delimiter=",")
csv_writer.writerow([str_])
#read from the csv file created above
with open("output.csv",newline='') as file:
csv_reader = csv.reader(file)
for row in csv_reader:
print(row)
The result I get is
['the United Nations' Sustainable Development Goals (SDGs)']
, which is ideal and
What encoding should I ideally be using if I want to preserve the special characters ? Is there a benefit of using utf-8 over cp1252?
My use case is to feed lines in the CSV to a language model(GPT) and hence I want the text to be "English" / Unchanged..
I am using Python 3.8 on Windows 11
Upvotes: 0
Views: 3381
Reputation: 522081
with open("output.csv", 'w', newline='', encoding='utf-8') as file:
...
with open("output.csv",newline='') as file:
...
The problem is simply that you're explicitly, correctly writing UTF-8 to the file, but then open it for reading in some undefined implicit encoding, which in your case defaults to not UTF-8. Thus you're reading it wrong.
Also include the encoding when reading the file, and all is good:
with open('output.csv', newline='', encoding='utf-8') as file:
You should use UTF-8 as encoding, as it can encode all possible characters. Most other encodings can only encode some subset of all possible characters. You'd need to have a good reason to use another encoding. If you have a particular target in mind (e.g. Excel) and you know what encoding that target prefers, then use that. Otherwise UTF-8 is a sane default.
Upvotes: 1