Michail N
Michail N

Reputation: 3845

Take the sum of a pyspark dataframe per column efficiently

I work in Spark 1.6 (unfortunately). I have a dataframe with many columns with 0's and 1's as values. I want to take the percentage of 1's per column. So I do:

rowsNum = dfBinary.count()
dfStat = dfBinary.select([(count(when(col(c) == 1 , c))/rowsNum).
alias(c) for c in dfBinary.columns])

Is there a more efficient way to do this? Maybe a built in function with sum per column (I did not find any though).

Upvotes: 0

Views: 464

Answers (2)

Alper t. Turker
Alper t. Turker

Reputation: 35229

You can replace count and division with mean to avoid additional data scan

from pyspark.sql.functions import mean

dfStat = dfBinary.select([
    (mean(when(col(c) == 1 , c))).
    alias(c) for c in dfBinary.columns])

but otherwise, it is as efficient as you can get.

Upvotes: 1

Suresh
Suresh

Reputation: 5870

you can use sum() from functions module,

from pyspark.sql.functions import sum
dfBinary.select([(sum(c)/rowsNum).alias(c) for c in dfBinary.columns]).show()

Upvotes: 1

Related Questions