Reputation: 237
I have a dataframe (testdf
) and would like to get count and distinct count on a column (memid
) where another column (booking
/rental
) is not null or not empty (ie."")
testdf
:
memid booking rental
100 Y
100
120 Y
100 Y Y
Expected result: (for booking column not null/ not empty)
count(memid) count(distinct memid)
3 2
If it was SQL:
Select count(memid), count(distinct memid) from mydf
where booking is not null and booking!= ""
In PySpark:
mydf.filter("booking!=''").groupBy('booking').agg(count("patid"), countDistinct("patid"))
But I just want the overall counts and not have it grouped by..
Upvotes: 0
Views: 4323
Reputation: 126
You can just remove the GroupBy
and use agg
directly.
Like this.
from pyspark.sql import functions as F
mydf=mydf.filter("booking!=''").agg(F.count("patid"), F.countDistinct("patid"))
Upvotes: 4