Reputation: 677
How can I write following SQL query using Spark DataFrame API
?
val dt = spark.sql("select Subject, (count(Student)*100 / (select count(DISTINCT Student) from StudentTable)) as PercentPassed from StudentTable where Marks > 35 GROUP BY Subject").show(false)
Here I want to get percentage of students passed (out of 100%) , in each subject.
Below is the DataFrame registered as StudentTable
:
val data = Seq(
("S1","English",45),
("S1","Hindi",78),
("S1","Maths",67),
("S1","Science",23),
("S2","English",34),
("S2","Hindi",45),
("S2","Maths",88),
("S2","Science",65),
("S3","English",21),
("S3","Hindi",66),
("S3","Maths",63),
("S3","Science",44)
)
val df = spark.sparkContext.parallelize(data).toDF("Student","Subject","Marks")
df.createTempView("StudentTable")
Upvotes: 0
Views: 1453
Reputation: 7928
You can use a groupBy
and then aggregate
by a conditional count
df.groupBy($"Subject")
.agg(
count(when($"Marks" > 35, 1)) /
count($"Marks") * 100 as "PercentPassed")
Explanation:
First groupBy by subject.
Then in the aggregate, generate two counts:
Then dividing the first by the second and multiplying by 100 to have the percentage
RESULT
+-------+-----------------+
|Subject| PercentPassed|
+-------+-----------------+
|Science|66.66666666666666|
|English|33.33333333333333|
| Hindi| 100.0|
| Maths| 100.0|
+-------+-----------------+
Upvotes: 1