Spark Count(Column-Name) in Select clause

Question

How can I write following SQL query using Spark DataFrame API?

val dt = spark.sql("select Subject, (count(Student)*100 / (select count(DISTINCT Student) from StudentTable)) as PercentPassed from StudentTable where Marks > 35 GROUP BY Subject").show(false)

Here I want to get percentage of students passed (out of 100%) , in each subject.

Below is the DataFrame registered as StudentTable:

val data = Seq(
    ("S1","English",45),
    ("S1","Hindi",78),
    ("S1","Maths",67),
    ("S1","Science",23),
    ("S2","English",34),
    ("S2","Hindi",45),
    ("S2","Maths",88),
    ("S2","Science",65),
    ("S3","English",21),
    ("S3","Hindi",66),
    ("S3","Maths",63),
    ("S3","Science",44)
  )
val df = spark.sparkContext.parallelize(data).toDF("Student","Subject","Marks")
df.createTempView("StudentTable")

SCouto · Accepted Answer

You can use a groupBy and then aggregate by a conditional count

df.groupBy($"Subject")
  .agg(
        count(when($"Marks" > 35, 1)) / 
        count($"Marks") * 100 as "PercentPassed")

Explanation:
First groupBy by subject.

Then in the aggregate, generate two counts:

The first one filtering by a Marks value (passed)
The second one count every row (total)

Then dividing the first by the second and multiplying by 100 to have the percentage

RESULT

+-------+-----------------+
|Subject|    PercentPassed|
+-------+-----------------+
|Science|66.66666666666666|
|English|33.33333333333333|
|  Hindi|            100.0|
|  Maths|            100.0|
+-------+-----------------+

Spark Count(Column-Name) in Select clause

Answers (1)

Related Questions