Abir Chokraborty
Abir Chokraborty

Reputation: 1765

How to count the number of occurrences of each distinct element in a column of a spark dataframe

Suppose I have a dataframe in the following format:

-------------------------------
   col1    |  col2    | col3
-------------------------------
value11    | value21  | value31
value12    | value22  | value32
value11    | value22  | value33
value12    | value21  | value33

Here, column col1 has value11, value12 as distinct value. I want the total number of occurrences of each distinct value value11, value12 of column col1.

Upvotes: 1

Views: 6158

Answers (1)

akuiper
akuiper

Reputation: 214927

You can groupBy col1, then count:

import org.apache.spark.sql.functions.count

df.groupBy("col1").agg(count("col1")).show
+-------+-----------+
|   col1|count(col1)|
+-------+-----------+
|value12|          2|
|value11|          2|
+-------+-----------+

In case you want to know how many distinct values there are in col1, you can use countDistinct:

import org.apache.spark.sql.functions.countDistinct

df.agg(countDistinct("col1").as("n_distinct")).show
+----------+
|n_distinct|
+----------+
|         2|
+----------+

Upvotes: 2

Related Questions