Reputation: 3845
I work in Spark 1.6 (unfortunately). I have a dataframe with many columns with 0's and 1's as values. I want to take the percentage of 1's per column. So I do:
rowsNum = dfBinary.count()
dfStat = dfBinary.select([(count(when(col(c) == 1 , c))/rowsNum).
alias(c) for c in dfBinary.columns])
Is there a more efficient way to do this? Maybe a built in function with sum per column (I did not find any though).
Upvotes: 0
Views: 464
Reputation: 35229
You can replace count
and division with mean
to avoid additional data scan
from pyspark.sql.functions import mean
dfStat = dfBinary.select([
(mean(when(col(c) == 1 , c))).
alias(c) for c in dfBinary.columns])
but otherwise, it is as efficient as you can get.
Upvotes: 1
Reputation: 5870
you can use sum() from functions module,
from pyspark.sql.functions import sum
dfBinary.select([(sum(c)/rowsNum).alias(c) for c in dfBinary.columns]).show()
Upvotes: 1