Reputation: 1765
Suppose I have a dataframe
in the following format:
-------------------------------
col1 | col2 | col3
-------------------------------
value11 | value21 | value31
value12 | value22 | value32
value11 | value22 | value33
value12 | value21 | value33
Here, column col1
has value11, value12
as distinct value. I want the total number of occurrences of each distinct value value11, value12
of column col1
.
Upvotes: 1
Views: 6158
Reputation: 214927
You can groupBy
col1, then count
:
import org.apache.spark.sql.functions.count
df.groupBy("col1").agg(count("col1")).show
+-------+-----------+
| col1|count(col1)|
+-------+-----------+
|value12| 2|
|value11| 2|
+-------+-----------+
In case you want to know how many distinct values there are in col1, you can use countDistinct
:
import org.apache.spark.sql.functions.countDistinct
df.agg(countDistinct("col1").as("n_distinct")).show
+----------+
|n_distinct|
+----------+
| 2|
+----------+
Upvotes: 2