Reputation: 145
I have a dataframe with columns with duplicate names. The contents of these columns are different, but unfortunately the names are the same. I would like to change the names of the columns by adding say - a number series to the columns to make each column unique like this..
foo1 | foo2 | laa3 | boo4 ...
----------------------------------
| | |
Is there a way to do that? I found a tool for scala spark here, but none for pyspark.
https://rdrr.io/cran/sparklyr/src/R/utils.R#sym-spark_sanitize_names
Upvotes: 0
Views: 597
Reputation: 31490
We can use enumerate on df.columns
then append index
value to the column name.
In Pyspark:
df.show()
#+---+---+---+---+
#| i| j| k| l|
#+---+---+---+---+
#| a| 1| v| p|
#+---+---+---+---+
new_cols=[elm + str(index+1) for index,elm in enumerate(df.columns)]
#['i1', 'j2', 'k3', 'l4']
#creating new dataframe with new column names
df1=df.toDF(*new_cols)
df1.show()
#+---+---+---+---+
#| i1| j2| k3| l4|
#+---+---+---+---+
#| a| 1| v| p|
#+---+---+---+---+
In Scala:
val new_cols=df.columns.zipWithIndex.collect{case(a,i) => a+(i+1)}
val df1=df.toDF(new_cols:_*)
df1.show()
//+---+---+---+---+
//| i1| j2| k3| l4|
//+---+---+---+---+
//| a| 1| v| p|
//+---+---+---+---+
Upvotes: 1