Reputation: 581
I need to use F.countDistinct() in a Spark SQL query using grouping sets which is only available through the Spark SQL Api but F.countDistinct() is only available in the Python/Scala API. I've tried to register it but to no avail.
Here's what I tried:
df = spark.createDataFrame([
[2020, "England"],
[2019, "Wales"],
[2018, "Ireland"],
[2020, "England"],
[2016, "England"],
[2015, "Ireland"],
[2006, "France"],
[2005, "Wales"],
[2004, "France"],
[2000, "England"],
[2002, "France"],
[2000, "England"],
[2020, "England"],
],
["year", "nation"]
)
spark.sqlContext.udf.register("countDistinct", F.countDistinct)
df.selectExpr("countDistinct(nation) as c_distinct")
Then I got a very long error message with : AttributeError: 'NoneType' object has no attribute '_jvm'
Upvotes: 0
Views: 339
Reputation: 42352
You can't register Spark SQL functions as a UDF - that makes no sense. You just need the correct syntax. In this case, it would be
df.selectExpr('count (distinct nation) as c_distinct').show()
+----------+
|c_distinct|
+----------+
| 4|
+----------+
Upvotes: 1