Reputation: 427
I have a pandas dataframe, which I am creating from an excel (.xlsx) file. Using df = pd.read_excel()
One of the column names in the excel document is: IN - Original ex factory date
.
I am trying to remove all special characters and white spaces as I will be saving this file in Avro format in Spark.
I am doing the following to clean the column names:
df.columns = df.columns.str.replace('(', '')
df.columns = df.columns.str.replace(')', '')
df.columns = df.columns.str.replace('=', '')
df.columns = df.columns.str.replace('-', '_')
df.columns = df.columns.str.replace('/', '_')
df.columns = df.columns.str.replace('Â', '__')
df.columns = df.columns.str.replace(' ', '_')
After the above transformations, I am left with the above column name as:
IN___Original__ ex_factory_date
It doesn't remove the whitespace before ex_factory_date
. And you can see all the other whitespaces are replaced by _
.
I even tried to use unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
only_ascii = nfkd_form.encode('ASCII', 'ignore')
return only_ascii
for col in df.columns:
print(col)
neutral = remove_accents(col)
print(normal)
It is returning the same thing: IN___Original__ ex_factory_date
Is there something I am missing as to why this wihtespace is not removed? I need a clean column name.
Upvotes: 0
Views: 280
Reputation: 1888
Most likely, your space is not the ASCII space character, but rather one of the many Unicode whitespace characters (see this extensive answer https://stackoverflow.com/a/37903645/5660315). This code should replace your whitespace:
# .str.replace supports regular expressions by default
df.columns = df.columns.str.replace('\s', '_')
Upvotes: 1