Reputation: 6566
Say I have a Spark SQL DataFrame like so:
name gender grade
-----------------
Joe M 3
Sue F 2
Pam F 3
Gil M 2
Lon F 3
Kim F 3
Zoe F 2
I want to create a report of single values like so:
numMales numFemales numGrade2 numGrade3
---------------------------------------
2 5 3 4
What is the best way to do this? I know how to get one of these individually like so:
val numMales = dataDF.where($"gender" == "M").count
But I don't really know how to put this into a DataFrame, or how to combine all the results.
Upvotes: 0
Views: 551
Reputation: 35219
You can explode and pivot:
import org.apache.spark.sql.functions._
val cols = Seq("gender", "grade")
df
.select(explode(array(cols map (c => concat(lit(c), col(c))): _*)))
.groupBy().pivot("col").count.show
// +-------+-------+------+------+
// |genderF|genderM|grade2|grade3|
// +-------+-------+------+------+
// | 5| 2| 3| 4|
// +-------+-------+------+------+
Upvotes: 1
Reputation: 41957
Use of when
, sum
and struct
inbuilt functions should give you your desired result
import org.apache.spark.sql.functions._
dataDF.select(struct(sum(when(col("gender")==="M", 1)).as("numMales"), sum(when(col("gender")==="F", 1)).as("numFemales")).as("genderCounts"),
struct(sum(when(col("grade")===2, 1)).as("numGrade2"), sum(when(col("grade")===3, 1)).as("numGrade3")).as("gradeCounts"))
.select(col("genderCounts.*"), col("gradeCounts.*"))
.show(false)
which should give you
+--------+----------+---------+---------+
|numMales|numFemales|numGrade2|numGrade3|
+--------+----------+---------+---------+
|2 |5 |3 |4 |
+--------+----------+---------+---------+
Upvotes: 2
Reputation: 40884
I'd say that you need to .groupBy().count()
your dataframe separately by each column, them combine the answers into a new dataframe.
Upvotes: 0