Remove non-ASCII characters from DataFrame column headers

Question

I have exported a comma separated value file from a MSQL database (rpt-file ending). It only has two columns and 8 rows. Looking at the file in notepad everything looks OK. I tried to load the data into a pandas data frame using the code below:

import pandas as pd
with open('file.csv', 'r') as csvfile:        
    df_data = pd.read_csv(csvfile, sep=',' , encoding = 'utf-8')
print(df_data)

When printing to console the first column header name is wrong with some extra characters, ï»¿ , at the start of column 1. I get no errors but obviously the first column is decoded wrongly in my code:Image of output

Anyone have any ideas on how to get this right?

cs95 · Accepted Answer

Here's one possible option: Fix those headers after loading them in:

df.columns = [x.encode('utf-8').decode('ascii', 'ignore') for x in df.columns]

The str.encode followed by the str.decode call will drop those special characters, leaving only the ones in ASCII range behind:

>>> 'ï»¿aSA'.encode('utf-8').decode('ascii', 'ignore')
'aSA'

Remove non-ASCII characters from DataFrame column headers

Answers (1)

Related Questions