Pandas dataframe column name not formatting as expected

Question

I have a pandas dataframe, which I am creating from an excel (.xlsx) file. Using df = pd.read_excel()

One of the column names in the excel document is: IN - OriginalÂ ex factory date.

I am trying to remove all special characters and white spaces as I will be saving this file in Avro format in Spark.

I am doing the following to clean the column names:

df.columns = df.columns.str.replace('(', '')
df.columns = df.columns.str.replace(')', '')
df.columns = df.columns.str.replace('=', '')
df.columns = df.columns.str.replace('-', '_')
df.columns = df.columns.str.replace('/', '_')
df.columns = df.columns.str.replace('Â', '__')
df.columns = df.columns.str.replace(' ', '_')

After the above transformations, I am left with the above column name as:

IN___Original__ ex_factory_date

It doesn't remove the whitespace before ex_factory_date. And you can see all the other whitespaces are replaced by _.

I even tried to use unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii

for col in df.columns:
    print(col)
    neutral = remove_accents(col)
    print(normal)

It is returning the same thing: IN___Original__ ex_factory_date

Is there something I am missing as to why this wihtespace is not removed? I need a clean column name.

VirtualScooter · Accepted Answer

Most likely, your space is not the ASCII space character, but rather one of the many Unicode whitespace characters (see this extensive answer https://stackoverflow.com/a/37903645/5660315). This code should replace your whitespace:

# .str.replace supports regular expressions by default
df.columns = df.columns.str.replace('\s', '_')

Pandas dataframe column name not formatting as expected

Answers (1)

Related Questions