Reputation: 1345
edf.select("x").distinct.show()
shows the distinct values that are present in x
column of edf
DataFrame.
Is there an efficient method to also show the number of times these distinct values occur in the data frame? (count for each distinct value)
Upvotes: 40
Views: 150347
Reputation: 2103
If you are using Java, then import org.apache.spark.sql.functions.countDistinct;
will give an error :
The import org.apache.spark.sql.functions.countDistinct cannot be resolved
To use the countDistinct
in java, use the below format:
import org.apache.spark.sql.functions.*;
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.*;
df.agg(functions.countDistinct("some_column"));
Upvotes: 3
Reputation: 71
import org.apache.spark.sql.functions.countDistinct
df.groupBy("a").agg(countDistinct("s")).collect()
Upvotes: 7
Reputation: 2622
Another option without resorting to sql functions
df.groupBy('your_column_name').count().show()
show will print the different values and their occurrences. The result without show will be a dataframe.
Upvotes: 13
Reputation: 330063
countDistinct
is probably the first choice:
import org.apache.spark.sql.functions.countDistinct
df.agg(countDistinct("some_column"))
If speed is more important than the accuracy you may consider approx_count_distinct
(approxCountDistinct
in Spark 1.x):
import org.apache.spark.sql.functions.approx_count_distinct
df.agg(approx_count_distinct("some_column"))
To get values and counts:
df.groupBy("some_column").count()
In SQL (spark-sql
):
SELECT COUNT(DISTINCT some_column) FROM df
and
SELECT approx_count_distinct(some_column) FROM df
Upvotes: 80