Take the sum of a pyspark dataframe per column efficiently

Question

I work in Spark 1.6 (unfortunately). I have a dataframe with many columns with 0's and 1's as values. I want to take the percentage of 1's per column. So I do:

rowsNum = dfBinary.count()
dfStat = dfBinary.select([(count(when(col(c) == 1 , c))/rowsNum).
alias(c) for c in dfBinary.columns])

Is there a more efficient way to do this? Maybe a built in function with sum per column (I did not find any though).

Alper t. Turker · Accepted Answer

You can replace count and division with mean to avoid additional data scan

from pyspark.sql.functions import mean

dfStat = dfBinary.select([
    (mean(when(col(c) == 1 , c))).
    alias(c) for c in dfBinary.columns])

but otherwise, it is as efficient as you can get.

Take the sum of a pyspark dataframe per column efficiently

Answers (2)

Related Questions