Count a column based on distinct value of another column pyspark

Question

I have a spark dataframe like this

event_name | id
---------------
hello      | 1
hello      | 2
hello      | 1
world      | 1
hello      | 3
world      | 2

I want to count the number of a specific event "hello" based on unique "id". The SQL should look like this

SELECT event_name, COUNT(DISTINCT id) as count
FROM table_name
WHERE event_name="hello"

event_name | count
------------------
hello      | 3

So my query should return 3 instead of 4 for "hello" because there are two rows with id "1" for "hello".

How can I do that with pyspark SQL?

Ged · Accepted Answer

This should do the trick:

df.groupBy("event_name").agg(F.countDistinct("id")).show()

Answers (1)