Reputation: 1095
I'm using MS SQL Server Management Studio and I have a simple table with the following data:
CountryId CommonName FormalName
--------- ---------- ----------
1 Afghanistan Islamic State of Afghanistan
2 Albania Republic of Albania
3 Algeria People's Democratic Republic of Algeria
4 Andorra Principality of Andorra
I use "Save Results As" to save this data into countries.csv
using the default UTF8 encoding. Then I go into iPython and read it into a data frame using pandas:
df = pd.read_csv("countries.csv")
If I do
df.columns
I get:
Index([u'CountryId', u'CommonName', u'FormalName'], dtype='object')
The weird thing is that when I copy the column names, paste it into a new cell, and press Enter, I get:
u'\ufeffCountryId', u'CommonName', u'FormalName'
An unicode character \ufeff
shows up in the beginning of the first column name.
I tried the procedure with different tables and every time I got the extra character. And it happens to the first column name only.
Can anyone explain to me why the extra unicode character showed up?
Upvotes: 0
Views: 847
Reputation: 39621
Try using the encoding = "utf-8-sig"
option with read_csv
. For example:
df = pd.read_csv("countries.csv", encoding = "utf-8-sig")
That should get it to ignore the Unicode Byte Order Mark (BOM) at the start of the CSV file. The use of BOM unnecessary here as UTF-8 files don't have an byte order, but Microsoft tools like to use it as a magic number to identify UTF-8 encoded text files.
Upvotes: 3