Reputation: 12518
I'm trying to look at parquet files and would like to show the number of distinct value of a column and the number of rows it is found in.
The SQL equivalent is:
select distinct(last_name), count(*) from optimization.opt_res group by (last_name)
In scala-spark (displays them separately):
val dataFrame = sparkSession.read.parquet(fname)
dataFrame.show(truncate = false)
val disID = dataFrame.select("last_name").distinct()
disID.show(false)
val disCount = disID.count
I want it to show
+-----------+-------+
| last_name | count |
+-----------+-------+
| Alfred | 202 |
| James | 1020 |
+-----------+-------+
Upvotes: 0
Views: 590
Reputation: 1380
dataframe.groupBy($"last_name").agg(count("*"))
or
dataframe.groupBy($"last_name").count
The concept is the same as SQL, but the syntax can be a bit tricky until you get used to it.
Upvotes: 1