count of distinct columns using group by and calculating percentage

Question

Trying to write a sql query:

select indicator, count(distinct tid) as tidcount
from coa
group by indicator

below is normal output

indicator tidcount
M             6219
Z             411424
S             1
I             1

I need row wise percentage output for tidcounts:

The query I'm trying is below

spark.sql(""" select indicator ,count(tid) as tidcount , round(round(count(indicator)/sum(count(indicator)) over (), 4)* 100, 4) as PERCENTAGE_TOTALS from coa group by indicator """)

indicator tidcount    Percentage_total
M             6219        0.72
Z             411424      98.78
S             1           .49
I             1           .02

expected output is:

indicator tidcount    Percentage_total
M             6219        1.4
Z             411424      98.5
S             1           .0002
I             1           .0002

Please suggest if i am missing anything it should be in either spark-sql or pyspark

Shubham Sharma · Accepted Answer

Solution with `spark.sql`

spark.sql(
    """select 
           indicator,
           COUNT(DISTINCT tid) AS tidcount,
           COUNT(DISTINCT tid) / sum(COUNT(DISTINCT tid)) over () * 100 AS PCT 
       from coa 
       group by indicator"""
)

Solution with `pyspark`

w = Window.partitionBy()

(
    df
    .groupby('indicator')
    .agg(F.count_distinct('tid').alias('tidcount'))
    .withColumn('PCT', F.col('tidcount') / F.sum('tidcount').over(w) * 100)
)

Example

df.show()

+---------+---+
|indicator|tid|
+---------+---+
|        a| 10|
|        a| 25|
|        a|  7|
|        b| 10|
|        b| 10|
|        c| 25|
|        c|  7|
|        d|  1|
|        a|  2|
|        a|  3|
+---------+---+

Result

+---------+--------+-----------------+
|indicator|tidcount|              PCT|
+---------+--------+-----------------+
|        d|       1|11.11111111111111|
|        c|       2|22.22222222222222|
|        b|       1|11.11111111111111|
|        a|       5|55.55555555555556|
+---------+--------+-----------------+

count of distinct columns using group by and calculating percentage

Answers (1)

Solution with `spark.sql`

Solution with `pyspark`

Example

Result

Related Questions

count of distinct columns using group by and calculating percentage

Answers (1)

Solution with spark.sql

Solution with pyspark

Example

Result

Related Questions

Solution with `spark.sql`

Solution with `pyspark`