Reputation: 988
I am loading a csv into pyspark dataframe. I am trying to remove spaces and more special characters like "(", ")" and "/" from the column headers.
I could remove spaces from the column headers like below.
for col in df.columns:
df = df.withColumnRenamed(col,col.replace(" ", "").replace("(", "").replace(")", "").replace("/", ""))
But this doesnt work. It removes only spaces in the columns but not the special characters.
I tried as below and it works
for col in df.columns:
df = df.withColumnRenamed(col,col.replace(" ", "").replace("(", "").replace(")", "").replace("/", ""))
Is there an elegant way of removing? Thanks.
Upvotes: 0
Views: 386
Reputation: 8711
You can use re module with alternatives. Check this.
import re
columns=["Column 1","column(2)","column /3 "]
newcols = [ re.sub('\(|\)|/|\s+',"",col) for col in columns]
newcols
['Column1', 'column2', 'column3']
with spark,
data=[[201912,12,"AA"],[201912,11,"BB"]]
columns=["Column 1","column(2)","column /3 "]
df=spark.createDataFrame(data,columns)
df.show()
+--------+---------+----------+
|Column 1|column(2)|column /3 |
+--------+---------+----------+
| 201912| 12| AA|
| 201912| 11| BB|
+--------+---------+----------+
import re
columns=["Column 1","column(2)","column /3 "]
newcols = [ re.sub('\(|\)|/|\s+',"",col) for col in columns]
dfcols = dict(zip(columns,newcols))
dfcols
{'Column 1': 'Column1', 'column(2)': 'column2', 'column /3 ': 'column3'}
for old, new in dfcols.items():
df=df.withColumnRenamed(old,new)
df.printSchema()
root
|-- Column1: long (nullable = true)
|-- column2: long (nullable = true)
|-- column3: string (nullable = true)
df.select(*dfcols.values()).show()
+-------+-------+-------+
|Column1|column2|column3|
+-------+-------+-------+
| 201912| 12| AA|
| 201912| 11| BB|
+-------+-------+-------+
Upvotes: 1
Reputation: 42352
Try below code:
to_replace = [" ", "(", ")", "/"]
for col in df.columns:
col2 = col
for s in to_replace:
col2 = col2.replace(s, "")
df = df.withColumnRenamed(col, col2)
Upvotes: 1