Lilly
Lilly

Reputation: 988

pyspark: dataframe header transformation

I am loading a csv into pyspark dataframe. I am trying to remove spaces and more special characters like "(", ")" and "/" from the column headers.

I could remove spaces from the column headers like below.

for col in df.columns:
  df = df.withColumnRenamed(col,col.replace(" ", "").replace("(", "").replace(")", "").replace("/", "")) 
   

But this doesnt work. It removes only spaces in the columns but not the special characters.

I tried as below and it works

for col in df.columns:
  df = df.withColumnRenamed(col,col.replace(" ", "").replace("(", "").replace(")", "").replace("/", "")) 
  

Is there an elegant way of removing? Thanks.

Upvotes: 0

Views: 386

Answers (2)

stack0114106
stack0114106

Reputation: 8711

You can use re module with alternatives. Check this.

import re
columns=["Column 1","column(2)","column /3 "]
newcols = [ re.sub('\(|\)|/|\s+',"",col) for col in columns]
newcols

['Column1', 'column2', 'column3']

with spark,

data=[[201912,12,"AA"],[201912,11,"BB"]]
columns=["Column 1","column(2)","column /3 "]
df=spark.createDataFrame(data,columns)

df.show()
+--------+---------+----------+
|Column 1|column(2)|column /3 |
+--------+---------+----------+
|  201912|       12|        AA|
|  201912|       11|        BB|
+--------+---------+----------+

import re
columns=["Column 1","column(2)","column /3 "]
newcols = [ re.sub('\(|\)|/|\s+',"",col) for col in columns]
dfcols = dict(zip(columns,newcols))
dfcols

{'Column 1': 'Column1', 'column(2)': 'column2', 'column /3 ': 'column3'}

for old, new in dfcols.items():
    df=df.withColumnRenamed(old,new)
df.printSchema()
root
 |-- Column1: long (nullable = true)
 |-- column2: long (nullable = true)
 |-- column3: string (nullable = true)

df.select(*dfcols.values()).show()
+-------+-------+-------+
|Column1|column2|column3|
+-------+-------+-------+
| 201912|     12|     AA|
| 201912|     11|     BB|
+-------+-------+-------+

Upvotes: 1

mck
mck

Reputation: 42352

Try below code:

to_replace = [" ", "(", ")", "/"]

for col in df.columns:
    col2 = col
    for s in to_replace:
        col2 = col2.replace(s, "")
    df = df.withColumnRenamed(col, col2)

Upvotes: 1

Related Questions