sargupta
sargupta

Reputation: 1043

Count unique column values given another column in PySpark

I am trying to count Date for each unique ID in Pyspark.

+-------------------+----------+
|               Date|        ID|
+-------------------+----------+
|2022-03-19 00:00:00|   Ax3838J|
|2022-03-11 00:00:00|   Ax3838J|
|2021-11-01 00:00:00|   Ax3838J|
|2021-10-27 00:00:00|   Ax3838J|
|2021-10-25 00:00:00|   Bz3838J|
|2021-10-22 00:00:00|   Bz3838J|
|2021-10-18 00:00:00|   Bz3838J|
|2021-10-15 00:00:00|   Rr7422u|
|2021-09-22 00:00:00|   Rr742uL|
+-------------------+----------+

When I tried

df.groupBy('ID').count('Date').show()

I got the error: _api() takes 1 positional argument but 2 were given which makes sense, but I am not sure what are the other techniques exits to count so in PySpark.

How do I count unique Date values with this:

df.groupBy('ID').count().show()

Expected output:

+-------------------+----------+
|               Date|        ID|
+-------------------+----------+
|                  4|   Ax3838J|
|                  3|   Bz3838J|
|                  2|   Rr742uL|
+-------------------+----------+

Upvotes: 0

Views: 160

Answers (2)

Mahesh Gupta
Mahesh Gupta

Reputation: 1892

Please find the working version of expected output. I am running code on spark-3.

from pyspark.sql.functions import countDistinct

data = [["2022-03-19 00:00:00", "Ax3838J"], ["2022-03-11 00:00:00", "Ax3838J"], ["2021-11-01 00:00:00", "Ax3838J"], ["2021-10-27 00:00:00", "Ax3838J"], ["2021-10-25 00:00:00", "Bz3838J"], ["2021-10-22 00:00:00", "Bz3838J"], ["2021-10-18 00:00:00", "Bz3838J"], ["2021-10-15 00:00:00", "Rr7422u"], ["2021-09-22 00:00:00", "Rr742uL"]]
df = spark.createDataFrame(data, ['Date', 'ID'])
df.show()
+-------------------+-------+
|               Date|     ID|
+-------------------+-------+
|2022-03-19 00:00:00|Ax3838J|
|2022-03-11 00:00:00|Ax3838J|
|2021-11-01 00:00:00|Ax3838J|
|2021-10-27 00:00:00|Ax3838J|
|2021-10-25 00:00:00|Bz3838J|
|2021-10-22 00:00:00|Bz3838J|
|2021-10-18 00:00:00|Bz3838J|
|2021-10-15 00:00:00|Rr742uL|
|2021-09-22 00:00:00|Rr742uL|
+-------------------+-------+

df.groupby("ID").agg(countDistinct("Date").alias("count")).show()
+-------+-----+
|     ID|count|
+-------+-----+
|Rr742uL|    2|
|Ax3838J|    4|
|Bz3838J|    3|
+-------+-----+

Please let me know if you need any help and if its solve your purpose please accept it

Upvotes: 1

Ashutosh sharma
Ashutosh sharma

Reputation: 76

try this:

df.groupBy('ID').count(distinct 'Date').show()

Upvotes: 0

Related Questions