pyspark: evaluate the sum of all elements in a dataframe

Question

I am trying to evaluate, in pyspark, the sum of all elements of a dataframe. I wrote the following function

def sum_all_elements(df):
    df = df.groupBy().sum()
    df = df.withColumn('total', sum(df[colname] for colname in df.columns))
    return df.select('total').collect()[0][0]

To speed up the function I have tried to convert to rdd and sum as

def sum_all_elements_pyspark(df):
    res = df.rdd.map(lambda x: sum(x)).sum()
    return res

But apparently the rdd function is slower than the dataframe's one. Is there a way to speed up the rdd function?

pyspark: evaluate the sum of all elements in a dataframe

Answers (1)

Related Questions