Reputation: 93
Trying to write a sql query:
select indicator, count(distinct tid) as tidcount
from coa
group by indicator
below is normal output
indicator tidcount
M 6219
Z 411424
S 1
I 1
I need row wise percentage output for tidcounts:
The query I'm trying is below
spark.sql(""" select indicator ,count(tid) as tidcount , round(round(count(indicator)/sum(count(indicator)) over (), 4)* 100, 4) as PERCENTAGE_TOTALS from coa group by indicator """)
indicator tidcount Percentage_total
M 6219 0.72
Z 411424 98.78
S 1 .49
I 1 .02
expected output is:
indicator tidcount Percentage_total
M 6219 1.4
Z 411424 98.5
S 1 .0002
I 1 .0002
Please suggest if i am missing anything it should be in either spark-sql or pyspark
Upvotes: 2
Views: 126
Reputation: 71689
spark.sql
spark.sql(
"""select
indicator,
COUNT(DISTINCT tid) AS tidcount,
COUNT(DISTINCT tid) / sum(COUNT(DISTINCT tid)) over () * 100 AS PCT
from coa
group by indicator"""
)
pyspark
w = Window.partitionBy()
(
df
.groupby('indicator')
.agg(F.count_distinct('tid').alias('tidcount'))
.withColumn('PCT', F.col('tidcount') / F.sum('tidcount').over(w) * 100)
)
df.show()
+---------+---+
|indicator|tid|
+---------+---+
| a| 10|
| a| 25|
| a| 7|
| b| 10|
| b| 10|
| c| 25|
| c| 7|
| d| 1|
| a| 2|
| a| 3|
+---------+---+
+---------+--------+-----------------+
|indicator|tidcount| PCT|
+---------+--------+-----------------+
| d| 1|11.11111111111111|
| c| 2|22.22222222222222|
| b| 1|11.11111111111111|
| a| 5|55.55555555555556|
+---------+--------+-----------------+
Upvotes: 1