Pyspark: Dynamically update columns position of a dataframe according to other dataframe

Question

I have a requirement to change column positions frequently. instead of changing the code i created a temporary dataframe Index_df. here i will update the column positions and it should reflect on actual dataframe on which the changes should perform.

sample_df

F_cDc,F_NHY,F_XUI,F_NMY,P_cDc,P_NHY,P_XUI,P_NMY
415    258   854   245   478   278   874   235
405    197   234   456   567   188   108   267
315    458   054   375   898   978   677   134

Index_df

   col   position
    F_cDc,1 
    F_NHY,3
    F_XUI,5
    F_NMY,7
    P_cDc,2 
    P_NHY,4
    P_XUI,6
    P_NMY,8

here according to the index_df,sample_df should change.

Expected output:

F_cDc,P_cDc,F_NHY,P_NHY,F_XUI,P_XUI,F_NMY,P_NMY
415    478   258   278   854   874   245   235
405    567   197   188   234   108   456   267
315    898   458   978   054   677   375   134

here column positions are changed according to the positions i have updated in Index_df

I could do sample_df.select("") but i have more than 70 columns. Technically which is not a best way to deal.

Steven · Accepted Answer

You can easily achieve that with select.

First, you retrieve your columns in the right order :

NewColList = Index_df.orderBy("position").select("col").collect()

Then you apply your new order to your df

sample_df = sample_df.select(*[i[0] for i in NewColList])

Pyspark: Dynamically update columns position of a dataframe according to other dataframe

Answers (1)

Related Questions