cornerstone347
cornerstone347

Reputation: 57

approx_count_distinct pyspark agg function with rsd argument in Databricks

In databricks, when I run approx_count_distinct function with 'rsd' argument, it returns the error message. It works without this argument.

Dataset

+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|James        |Sales     |3000  |
|Michael      |Sales     |4600  |
|Robert       |Sales     |4100  |
|Maria        |Finance   |3000  |
|James        |Sales     |3000  |
|Scott        |Finance   |3300  |
|Jen          |Finance   |3900  |
|Jeff         |Marketing |3000  |
|Kumar        |Marketing |2000  |
|Saif         |Sales     |4100  |
+-------------+----------+------+

Code

from pyspark.sql.functions import approx_count_distinct 
df.agg(approx_count_distinct(col("salary"))).alias("salaryDistinct")

Error message

py4j.Py4JException: Method approx_count_distinct([class org.apache.spark.sql.Column, class java.lang.Integer]) does not exist

Upvotes: 0

Views: 1505

Answers (1)

Rakesh Govindula
Rakesh Govindula

Reputation: 11489

I reproduced the above and got the same error.

enter image description here

The above error occurs when we give the rsd value as integer. As per pyspark.sql.functions.approx_count_distinct() rsd value should be float.

Desired result when float is given.

enter image description here

Upvotes: 1

Related Questions