Count unique column values given another column in PySpark

Question

I am trying to count Date for each unique ID in Pyspark.

+-------------------+----------+
|               Date|        ID|
+-------------------+----------+
|2022-03-19 00:00:00|   Ax3838J|
|2022-03-11 00:00:00|   Ax3838J|
|2021-11-01 00:00:00|   Ax3838J|
|2021-10-27 00:00:00|   Ax3838J|
|2021-10-25 00:00:00|   Bz3838J|
|2021-10-22 00:00:00|   Bz3838J|
|2021-10-18 00:00:00|   Bz3838J|
|2021-10-15 00:00:00|   Rr7422u|
|2021-09-22 00:00:00|   Rr742uL|
+-------------------+----------+

When I tried

df.groupBy('ID').count('Date').show()

I got the error: _api() takes 1 positional argument but 2 were given which makes sense, but I am not sure what are the other techniques exits to count so in PySpark.

How do I count unique Date values with this:

df.groupBy('ID').count().show()

Expected output:

+-------------------+----------+
|               Date|        ID|
+-------------------+----------+
|                  4|   Ax3838J|
|                  3|   Bz3838J|
|                  2|   Rr742uL|
+-------------------+----------+

Mahesh Gupta · Accepted Answer

Please find the working version of expected output. I am running code on spark-3.

from pyspark.sql.functions import countDistinct

data = [["2022-03-19 00:00:00", "Ax3838J"], ["2022-03-11 00:00:00", "Ax3838J"], ["2021-11-01 00:00:00", "Ax3838J"], ["2021-10-27 00:00:00", "Ax3838J"], ["2021-10-25 00:00:00", "Bz3838J"], ["2021-10-22 00:00:00", "Bz3838J"], ["2021-10-18 00:00:00", "Bz3838J"], ["2021-10-15 00:00:00", "Rr7422u"], ["2021-09-22 00:00:00", "Rr742uL"]]
df = spark.createDataFrame(data, ['Date', 'ID'])
df.show()
+-------------------+-------+
|               Date|     ID|
+-------------------+-------+
|2022-03-19 00:00:00|Ax3838J|
|2022-03-11 00:00:00|Ax3838J|
|2021-11-01 00:00:00|Ax3838J|
|2021-10-27 00:00:00|Ax3838J|
|2021-10-25 00:00:00|Bz3838J|
|2021-10-22 00:00:00|Bz3838J|
|2021-10-18 00:00:00|Bz3838J|
|2021-10-15 00:00:00|Rr742uL|
|2021-09-22 00:00:00|Rr742uL|
+-------------------+-------+

df.groupby("ID").agg(countDistinct("Date").alias("count")).show()
+-------+-----+
|     ID|count|
+-------+-----+
|Rr742uL|    2|
|Ax3838J|    4|
|Bz3838J|    3|
+-------+-----+

Please let me know if you need any help and if its solve your purpose please accept it

Count unique column values given another column in PySpark

Answers (2)

Related Questions