Reputation: 13091
I have dataframe of 10 columns and want to do function - concatenation based on Array of columns which come as input:
arr = ["col1", "col2", "col3"]
This is current so far:
newDF = rawDF.select(concat(col("col1"), col("col2"), col("col3") )).exceptAll(updateDF.select( concat(col("col1"), col("col2"), col("col3") ) ) )
Also:
df3 = df2.join(df1, concat( df2.col1, df2.col2, df2.col3, df2.col3 ) == df1.col5 )
But I want to make a loop or function to do this based on input array (not hard-coding it as is now).
What is the best way?
Upvotes: 1
Views: 213
Reputation: 8410
You can unpack the cols using (*). In the pyspark.sql docs, if any functions have (*cols), it means that you can unpack the cols. For concat:
pyspark.sql.functions.concat(*cols)
from pyspark.sql import functions as F
arr = ["col1", "col2", "col3"]
newDF = rawDF.select(F.concat(*(F.col(col) for col in arr))).exceptAll(updateDF.select(F.concat(*(F.col(col) for col in arr))))
For joins:
arr=['col1','col2','col3']
df3 = df2.join(df1, F.concat(*(F.col(col) for col in arr)) == df1.col5 )
Upvotes: 1