Reputation: 1690
For continuous data, one can use RDD.map(x => x.scores(0)).stats()
to calculate the summary statistics.
which gives result like
org.apache.spark.util.StatCounter = (count: 4498289, mean: 0.028091, stdev: 2.332627, max: 22.713133, min: -36.627933)
How to achieve similar result for categorical data in Spark? (count of distinct values, individual count of top values, etc)
Upvotes: 0
Views: 1299
Reputation: 1690
After further research, I found out how to get histograms of categorical data.
If anyone else is interested....
val countColumn = parsedLines.map(_.ColumnName).countByValue()
countColumn.toSeq.sortBy(_._2).reverse.foreach(println)
This will print each distinct values of columns and its count.
Upvotes: 1