Calculating percentage of multiple column values of a Spark DataFrame in PySpark

Question

I have multiple binary columns (0 and 1) in my Spark DataFrame. I want to calculate the percentage of 1 in each column and project the result in another DataFrame.

The input DataFrame dF looks like:

+------------+-----------+
|           a|          b|
+------------+-----------+
|           0|          1|
|           1|          1|
|           0|          0|
|           1|          1|
|           0|          1|
+------------+-----------+

Expected output would look like:

+------------+-----------+
|           a|          b|
+------------+-----------+
|          40|         80|
+------------+-----------+

40 (2/5) and 80 (4/5) is the percentage of 1 in columns a and b respectively.

What I tried so far is creating a custom aggregation function, passing over the two columns a and b to it, doing a group by to get the count of 0s and 1s, calculating percentages of 0s and 1s, and finally filtering the DataFrame to only keep the 1.

selection =  ['a', 'b']

@F.udf
def cal_perc(c, dF):
    grouped = dF.groupBy(c).count()
    grouped = grouped.withColumn('perc_' + str(c), ((grouped['count']/5) * 100))
    return grouped[grouped[c] == 1].select(['perc_' + str(c)])

dF.select(*(dF[c].alias(c) for c in selection)).agg(*(cal_perc(c, dF).alias(c) for c in selection)).show()

This does not seem to be working. I'm not able to figure out where I'm going wrong. Any help appreciated. Thanks.

Georg Heiler · Accepted Answer

If your columns in fact always are 0/1 and no other digits a mean should be equivalent. It is implemented natively in spark.

Calculating percentage of multiple column values of a Spark DataFrame in PySpark

Answers (1)

Related Questions