Scala spark, show distinct column value and count number of occurrence

Question

I'm trying to look at parquet files and would like to show the number of distinct value of a column and the number of rows it is found in.

The SQL equivalent is:

select distinct(last_name), count(*) from optimization.opt_res group by (last_name)

In scala-spark (displays them separately):

val dataFrame = sparkSession.read.parquet(fname)
dataFrame.show(truncate = false)
val disID = dataFrame.select("last_name").distinct()
disID.show(false)
val disCount = disID.count

I want it to show

+-----------+-------+
| last_name | count |
+-----------+-------+
| Alfred    |   202 |
| James     |  1020 |
+-----------+-------+

Charlie Flowers · Accepted Answer

dataframe.groupBy($"last_name").agg(count("*"))

or

dataframe.groupBy($"last_name").count

The concept is the same as SQL, but the syntax can be a bit tricky until you get used to it.

Scala spark, show distinct column value and count number of occurrence

Answers (1)

Related Questions