MS SQL Server Management Studio export to CSV introduces extra character when reading from pandas

Question

I'm using MS SQL Server Management Studio and I have a simple table with the following data:

CountryId     CommonName  FormalName
---------     ----------  ----------
        1    Afghanistan  Islamic State of Afghanistan
        2        Albania  Republic of Albania
        3        Algeria  People's Democratic Republic of Algeria
        4        Andorra  Principality of Andorra

I use "Save Results As" to save this data into countries.csv using the default UTF8 encoding. Then I go into iPython and read it into a data frame using pandas:

df = pd.read_csv("countries.csv")

If I do

df.columns

I get:

Index([u'CountryId', u'CommonName', u'FormalName'], dtype='object')

The weird thing is that when I copy the column names, paste it into a new cell, and press Enter, I get:

u'\ufeffCountryId', u'CommonName', u'FormalName'

An unicode character \ufeff shows up in the beginning of the first column name.

I tried the procedure with different tables and every time I got the extra character. And it happens to the first column name only.

Can anyone explain to me why the extra unicode character showed up?

Ross Ridge · Accepted Answer

Try using the encoding = "utf-8-sig" option with read_csv. For example:

df = pd.read_csv("countries.csv", encoding = "utf-8-sig")

That should get it to ignore the Unicode Byte Order Mark (BOM) at the start of the CSV file. The use of BOM unnecessary here as UTF-8 files don't have an byte order, but Microsoft tools like to use it as a magic number to identify UTF-8 encoded text files.

MS SQL Server Management Studio export to CSV introduces extra character when reading from pandas

Answers (1)

Related Questions