Reputation: 572
I have multiple binary columns (0 and 1) in my Spark DataFrame. I want to calculate the percentage of 1 in each column and project the result in another DataFrame.
The input DataFrame dF
looks like:
+------------+-----------+
| a| b|
+------------+-----------+
| 0| 1|
| 1| 1|
| 0| 0|
| 1| 1|
| 0| 1|
+------------+-----------+
Expected output would look like:
+------------+-----------+
| a| b|
+------------+-----------+
| 40| 80|
+------------+-----------+
40 (2/5) and 80 (4/5) is the percentage of 1 in columns a and b respectively.
What I tried so far is creating a custom aggregation function, passing over the two columns a
and b
to it, doing a group by to get the count of 0s and 1s, calculating percentages of 0s and 1s, and finally filtering the DataFrame to only keep the 1.
selection = ['a', 'b']
@F.udf
def cal_perc(c, dF):
grouped = dF.groupBy(c).count()
grouped = grouped.withColumn('perc_' + str(c), ((grouped['count']/5) * 100))
return grouped[grouped[c] == 1].select(['perc_' + str(c)])
dF.select(*(dF[c].alias(c) for c in selection)).agg(*(cal_perc(c, dF).alias(c) for c in selection)).show()
This does not seem to be working. I'm not able to figure out where I'm going wrong. Any help appreciated. Thanks.
Upvotes: 0
Views: 875
Reputation: 17676
If your columns in fact always are 0/1 and no other digits a mean should be equivalent. It is implemented natively in spark.
Upvotes: 4