Managing multiple columns with duplicate names in pyspark dataframe using spark_sanitize_names

Question

I have a dataframe with columns with duplicate names. The contents of these columns are different, but unfortunately the names are the same. I would like to change the names of the columns by adding say - a number series to the columns to make each column unique like this..

foo1 |  foo2  |  laa3  |  boo4 ...
----------------------------------
     |        |        |

Is there a way to do that? I found a tool for scala spark here, but none for pyspark.

https://rdrr.io/cran/sparklyr/src/R/utils.R#sym-spark_sanitize_names

notNull · Accepted Answer

We can use enumerate on df.columns then append index value to the column name.

finally create dataframe with new column names!

In Pyspark:

df.show()
#+---+---+---+---+
#|  i|  j|  k|  l|
#+---+---+---+---+
#|  a|  1|  v|  p|
#+---+---+---+---+

new_cols=[elm + str(index+1) for index,elm in enumerate(df.columns)]
#['i1', 'j2', 'k3', 'l4']

#creating new dataframe with new column names
df1=df.toDF(*new_cols)

df1.show()
#+---+---+---+---+
#| i1| j2| k3| l4|
#+---+---+---+---+
#|  a|  1|  v|  p|
#+---+---+---+---+

In Scala:

val new_cols=df.columns.zipWithIndex.collect{case(a,i) => a+(i+1)}

val df1=df.toDF(new_cols:_*)

df1.show()
//+---+---+---+---+
//| i1| j2| k3| l4|
//+---+---+---+---+
//|  a|  1|  v|  p|
//+---+---+---+---+

Managing multiple columns with duplicate names in pyspark dataframe using spark_sanitize_names

Answers (1)

Related Questions