Pyspark: function joining changeable number of columns

Question

I wonder if there is a way to automatise that... I want to make a function in which I will tell, how many columns I want to join. If I have dataFrame with 3 columns and give a parameter "number_of_columns=3", than it will join columns: 0, 1, 2. But if I have dataFrame with 7 columns and give a parameter "number_of_columns=7", than it will join columns: 0, 1, 2, 3, 4, 5, 6. Names of the columns are always the same: From "0" to "number_of_columns-1".

Is there any way to do that? Or I must have another function if I have another number of columns to merge?

def my_function(spark_column, name_of_column):
    new_spark_column = spark_column.withColumn(name_of_column, concat_ws("", 
                                                   col("0").cast("Integer"), 
                                                   col("1").cast("Integer"),
                                                   col("2").cast("Integer"),
                                                   col("3").cast("Integer"),
                                                   col("4").cast("Integer"),
                                                   col("5").cast("Integer"),
                                                   col("6").cast("Integer") ))

mck · Accepted Answer

You can use a list comprehension to do this:

from pyspark.sql.functions import concat_ws, col

def my_function(spark_column, n_cols, name_of_column):
    new_spark_column = spark_column.withColumn(
        name_of_column, 
        concat_ws("", *[col(c).cast("Integer") for c in spark_column.columns[:n_cols]])
    )
    return new_spark_column

Pyspark: function joining changeable number of columns

Answers (1)

Related Questions