thecoder
thecoder

Reputation: 237

count and distinct count without groupby using PySpark

I have a dataframe (testdf) and would like to get count and distinct count on a column (memid) where another column (booking/rental) is not null or not empty (ie."")

testdf:

memid   booking  rental
100        Y 
100
120        Y
100        Y       Y

Expected result: (for booking column not null/ not empty)

count(memid)  count(distinct memid)
      3                      2

If it was SQL:

Select count(memid), count(distinct memid) from mydf 
where booking is not null and booking!= ""

In PySpark:

mydf.filter("booking!=''").groupBy('booking').agg(count("patid"), countDistinct("patid"))

But I just want the overall counts and not have it grouped by..

Upvotes: 0

Views: 4323

Answers (1)

deronwu
deronwu

Reputation: 126

You can just remove the GroupBy and use agg directly.
Like this.

from pyspark.sql import functions as F 
mydf=mydf.filter("booking!=''").agg(F.count("patid"), F.countDistinct("patid"))

Upvotes: 4

Related Questions