Joel
Joel

Reputation: 1690

Spark Categorical Data Summary Statistics

For continuous data, one can use RDD.map(x => x.scores(0)).stats() to calculate the summary statistics.

which gives result like org.apache.spark.util.StatCounter = (count: 4498289, mean: 0.028091, stdev: 2.332627, max: 22.713133, min: -36.627933)

How to achieve similar result for categorical data in Spark? (count of distinct values, individual count of top values, etc)

Upvotes: 0

Views: 1299

Answers (1)

Joel
Joel

Reputation: 1690

After further research, I found out how to get histograms of categorical data.
If anyone else is interested....

val countColumn = parsedLines.map(_.ColumnName).countByValue() countColumn.toSeq.sortBy(_._2).reverse.foreach(println)

This will print each distinct values of columns and its count.

Upvotes: 1

Related Questions