Rafiul Sabbir
Rafiul Sabbir

Reputation: 636

Count a column based on distinct value of another column pyspark

I have a spark dataframe like this

event_name | id
---------------
hello      | 1
hello      | 2
hello      | 1
world      | 1
hello      | 3
world      | 2

I want to count the number of a specific event "hello" based on unique "id". The SQL should look like this

SELECT event_name, COUNT(DISTINCT id) as count
FROM table_name
WHERE event_name="hello"
event_name | count
------------------
hello      | 3

So my query should return 3 instead of 4 for "hello" because there are two rows with id "1" for "hello".

How can I do that with pyspark SQL?

Upvotes: 2

Views: 2110

Answers (1)

Ged
Ged

Reputation: 18013

This should do the trick:

df.groupBy("event_name").agg(F.countDistinct("id")).show()

Upvotes: 4

Related Questions