Use F.countDistinct in a Spark SQL query

Question

I need to use F.countDistinct() in a Spark SQL query using grouping sets which is only available through the Spark SQL Api but F.countDistinct() is only available in the Python/Scala API. I've tried to register it but to no avail.

Here's what I tried:


df = spark.createDataFrame([
    [2020, "England"],
    [2019, "Wales"],
    [2018, "Ireland"],
    [2020, "England"],
    [2016, "England"],
    [2015, "Ireland"],
    [2006, "France"],
    [2005, "Wales"],
    [2004, "France"],
    [2000, "England"],
    [2002, "France"],
    [2000, "England"],
    [2020, "England"],
    ],
    ["year", "nation"]
)

spark.sqlContext.udf.register("countDistinct", F.countDistinct)

df.selectExpr("countDistinct(nation) as c_distinct")

Then I got a very long error message with : AttributeError: 'NoneType' object has no attribute '_jvm'

mck · Accepted Answer

You can't register Spark SQL functions as a UDF - that makes no sense. You just need the correct syntax. In this case, it would be

df.selectExpr('count (distinct nation) as c_distinct').show()
+----------+
|c_distinct|
+----------+
|         4|
+----------+

Use F.countDistinct in a Spark SQL query

Answers (1)

Related Questions