Giordan Pretelin
Giordan Pretelin

Reputation: 363

Calculating percentage of total count for groupBy using pyspark

I have the following code in pyspark, resulting in a table showing me the different values for a column and their counts. I want to have another column showing what percentage of the total count does each row represent. How do I do that?

difrgns = (
    df1
    .groupBy("column_name")
    .count()
    .sort(desc("count"))
    .show()
)

Upvotes: 16

Views: 38831

Answers (4)

dMb
dMb

Reputation: 9337

When df itself is a more complex transformation chain and running it twice -- first to compute the total count and then to group and compute percentages -- is too expensive, it's possible to leverage a window function to achieve similar results. Here's a more generalized code (extending bluephantom's answer) that could be used with a number of group-by dimensions:

from pyspark.sql import Row
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql.window import Window

data = [
    ("A", "X", 2, 100),
    ("A", "X", 7, 100),
    ("B", "X", 10, 100),
    ("C", "X", 1, 100),
    ("D", "X", 50, 100),
    ("E", "X", 30, 100),
]

rdd = sc.parallelize(data)

someschema = rdd.map(lambda x: Row(c1=x[0], c2=x[1], val1=T.int(x[2]), val2=T.int(x[3])))

df = (
    sqlContext
    .createDataFrame(someschema)
    .withColumn(
        "total_count",
        F.count("*").over(Window.partitionBy(<your N-1 dimensions here>))
    )
    .groupBy(<your N dimensions here>)
    .agg(
        (F.count("*") / F.first(F.col("total_count"))).alias("percent_total")
    )
)

df.show()

Upvotes: 7

s510
s510

Reputation: 2822

More "beatified" output, eliminating the excess decimals and sort it.

import pyspark.sql.functions as F

count_cl = data_fr.count()

(
    data_fr
    .groupBy("col_name")
    .count()
    .withColumn("%", F.round((F.col("count") / count_cl) * 100, 2))
    .orderBy("count", ascending=False)
    .show(4, False)
)
+--------------+-----+----+
| col_name     |count|   %|
+--------------------+----+
|      C.LQQQQ |30957|8.91|
|      C.LQQQQ |29688|8.54|
|      C-LQQQQ |29625|8.52|
|       CLQQQQ |29342|8.44|
+--------------------+----+

Upvotes: 5

Ged
Ged

Reputation: 18013

An example as an alternative if not comfortable with Windowing as the comment alludes to and is the better way to go:

# Running in Databricks, not all stuff required
from pyspark.sql import Row
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

data = [
    ("A", "X", 2, 100),
    ("A", "X", 7, 100),
    ("B", "X", 10, 100),
    ("C", "X", 1, 100),
    ("D", "X", 50, 100),
    ("E", "X", 30, 100),
]

rdd = sc.parallelize(data)

schema = rdd.map(lambda x: Row(c1=x[0], c2=x[1], val1=int(x[2]), val2=int(x[3])))

df = sqlContext.createDataFrame(schema)

tot = df.count()

(
    df
    .groupBy("c1")
    .count()
    .withColumnRenamed('count', 'cnt_per_group')
    .withColumn('perc_of_count_total', (F.col('cnt_per_group') / tot) * 100 )
    .show()
)

Which returns:

+---+-------------+-------------------+
| c1|cnt_per_group|perc_of_count_total|
+---+-------------+-------------------+
|  E|            1| 16.666666666666664|
|  B|            1| 16.666666666666664|
|  D|            1| 16.666666666666664|
|  C|            1| 16.666666666666664|
|  A|            2|  33.33333333333333|
+---+-------------+-------------------+

I focus on Scala and it seems easier with that. That said, the suggested solution via the comments uses Window which is what I would do in Scala with over().

Upvotes: 23

Mykola Zotko
Mykola Zotko

Reputation: 17804

You can groupby and aggregate with agg. For example, for the following DataFrame:

+--------+-----+
|category|value|
+--------+-----+
|       a|    1|
|       b|    2|
|       a|    3|
+--------+-----+

You can use:

import pyspark.sql.functions as F

(
    df
    .groupby("category")
    .agg(
        F.count("value").alias("count"),
        (F.count("value") / df.count()).alias("percentage")
    )
    .show()
)

Output:

+--------+-----+------------------+
|category|count|        percentage|
+--------+-----+------------------+
|       b|    1|0.3333333333333333|
|       a|    2|0.6666666666666666|
+--------+-----+------------------+

Alternatively, you can use SQL:

df.createOrReplaceTempView("df")

(
    spark
    .sql(
        """
        SELECT category,
            COUNT(*) AS count,
            COUNT(*) / (SELECT COUNT(*) FROM df) AS ratio
        FROM df
        GROUP BY category
        """
    )
    .show()
)

Upvotes: 6

Related Questions