Reputation: 636
I have a spark dataframe like this
event_name | id
---------------
hello | 1
hello | 2
hello | 1
world | 1
hello | 3
world | 2
I want to count the number of a specific event "hello" based on unique "id". The SQL should look like this
SELECT event_name, COUNT(DISTINCT id) as count
FROM table_name
WHERE event_name="hello"
event_name | count
------------------
hello | 3
So my query should return 3 instead of 4 for "hello" because there are two rows with id "1" for "hello".
How can I do that with pyspark SQL?
Upvotes: 2
Views: 2110
Reputation: 18013
This should do the trick:
df.groupBy("event_name").agg(F.countDistinct("id")).show()
Upvotes: 4