user3685285
user3685285

Reputation: 6566

Spark DataFrame Summary

Say I have a Spark SQL DataFrame like so:

name gender grade
-----------------
 Joe      M     3
 Sue      F     2
 Pam      F     3
 Gil      M     2
 Lon      F     3
 Kim      F     3
 Zoe      F     2

I want to create a report of single values like so:

numMales numFemales numGrade2 numGrade3
---------------------------------------
       2          5         3         4

What is the best way to do this? I know how to get one of these individually like so:

val numMales = dataDF.where($"gender" == "M").count

But I don't really know how to put this into a DataFrame, or how to combine all the results.

Upvotes: 0

Views: 551

Answers (3)

Alper t. Turker
Alper t. Turker

Reputation: 35219

You can explode and pivot:

import org.apache.spark.sql.functions._

val cols = Seq("gender", "grade")

df
  .select(explode(array(cols map (c => concat(lit(c), col(c))): _*)))
  .groupBy().pivot("col").count.show

// +-------+-------+------+------+                                                 
// |genderF|genderM|grade2|grade3|
// +-------+-------+------+------+
// |      5|      2|     3|     4|
// +-------+-------+------+------+

Upvotes: 1

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

Use of when, sum and struct inbuilt functions should give you your desired result

import org.apache.spark.sql.functions._
dataDF.select(struct(sum(when(col("gender")==="M", 1)).as("numMales"), sum(when(col("gender")==="F", 1)).as("numFemales")).as("genderCounts"),
            struct(sum(when(col("grade")===2, 1)).as("numGrade2"), sum(when(col("grade")===3, 1)).as("numGrade3")).as("gradeCounts"))
    .select(col("genderCounts.*"), col("gradeCounts.*"))
  .show(false)

which should give you

+--------+----------+---------+---------+
|numMales|numFemales|numGrade2|numGrade3|
+--------+----------+---------+---------+
|2       |5         |3        |4        |
+--------+----------+---------+---------+

Upvotes: 2

9000
9000

Reputation: 40884

I'd say that you need to .groupBy().count() your dataframe separately by each column, them combine the answers into a new dataframe.

Upvotes: 0

Related Questions