Pyspark addition across columns

Question

I have a dataframe with 100+ columns. I have get a derived field with addition of sum of the columns based on condition.

For example, NEW_COLUMN_VALUE should be sum of A_2,3 & 4

df = df.withColumn('NEW_COLUMN_VALUE', 
        when(col('Id')==1, col("A_2")+col("A_3")+col("A_4"))
        .otherwise(lit(None)))

another column should be the sum of A 18 to A40. Is there an easy way to avoid doing as below. (adding 22 columns... Columns follow a patter A_1,A_2.... till A_80 ; There are other id fields also

col("A_18")+col("A_19")+col("A_20").......

Kafels · Accepted Answer

Writing a few lines of Python to solve your problem easily:

from functools import reduce
from operator import add

import pyspark.sql.functions as f


def filter_columns(dataframe, start, stop):
    for column in dataframe.columns:
        if column.startswith('A_'):
            number = int(column.split('_')[-1])
            if start <= number <= stop:
                yield f.col(column)


# From A_1 to A_4
new_df = df.withColumn('foo', reduce(add, filter_columns(df, start=1, stop=4)))

# From A_18 to A_40
new_df = new_df.withColumn('bar', reduce(add, filter_columns(df, start=18, stop=40)))

Pyspark addition across columns

Answers (2)

Related Questions