Reputation: 123

Remove Period (.) from Dataframe Column Names

So I've gone through all the examples on here of replacing special characters from column names, but I can't seem to get it to work for periods.

What I've tried:

# works to remove spaces
df.select([F.col(c).alias(c.replace(' ', '_')) for c in df.columns])

# doesn't work to remove periods
df.select([F.col(c).alias(c.replace('.', '')) for c in df.columns])

# removes special characters except periods 
df.select([F.col(col).alias(re.sub("[^0-9a-zA-Z$]+","",col)) for col in df.columns])

I know how to change the name of a column by referencing that specific column, but this needs to change names of columns for any dataframe with columns with special characters

Specifically here is the column name that is giving me trouble: "Src. of Business Contact Full Name"

Upvotes: 2

Answers (3)

Rob Raymond

Reputation: 31226

select() is a deprecated method. Why not make as simple as below?

import re
df = pd.DataFrame(["a biz"], columns=["Src.$ of-Business Contact` Full Name"])
df.columns = [re.sub("[ ,-]", "_", re.sub("[\.,`,\$]", "", c)) for c in df.columns]
df

output

    Src_of_Business_Contact_Full_Name
0   a biz

Upvotes: 1

murtihash

Reputation: 8410

Another way to go about this using reduce and withColumnRenamed.

from functools import reduce

(reduce(lambda new_df, col: new_df.withColumnRenamed(col,col.replace('.','')),df.columns,df)).show()

Upvotes: 0

notNull

Reputation: 31510

Try by escaping column names using backquotes `col_name`

df=spark.createDataFrame([('1','2')],['header','pla.nned'])
df.columns
#['header', 'pla.nned']

from pyspark.sql import functions as F
df.select([F.col("`{0}`".format(c)).alias(c.replace('.', '')) for c in df.columns]).show()
#+------+-------+
#|header|planned|
#+------+-------+
#|     1|      2|
#+------+-------+

Upvotes: 1

Remove Period (.) from Dataframe Column Names

Answers (3)

Related Questions