Reputation: 57
In databricks, when I run approx_count_distinct function with 'rsd' argument, it returns the error message. It works without this argument.
Dataset
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|James |Sales |3000 |
|Michael |Sales |4600 |
|Robert |Sales |4100 |
|Maria |Finance |3000 |
|James |Sales |3000 |
|Scott |Finance |3300 |
|Jen |Finance |3900 |
|Jeff |Marketing |3000 |
|Kumar |Marketing |2000 |
|Saif |Sales |4100 |
+-------------+----------+------+
Code
from pyspark.sql.functions import approx_count_distinct
df.agg(approx_count_distinct(col("salary"))).alias("salaryDistinct")
Error message
py4j.Py4JException: Method approx_count_distinct([class org.apache.spark.sql.Column, class java.lang.Integer]) does not exist
Upvotes: 0
Views: 1505
Reputation: 11489
I reproduced the above and got the same error.
The above error occurs when we give the rsd
value as integer. As per pyspark.sql.functions.approx_count_distinct() rsd
value should be float
.
Desired result when float is given.
Upvotes: 1