gabi
gabi

Reputation: 1376

SparkSQL:Avg based on a column after GroupBy

I have an rdd of student grades and I need to first group them by the first column which is university and then show the average of student count in each course like this. What is the easiest way to do this query?

+----------+-------------------+                                                              
|university|  avg of students   |
+----------+--------------------+
|       MIT|    3               |
| Cambridge|    2.66  

Here is the dataset.

case class grade(university: String, courseId: Int, studentId: Int, grade: Double)

val grades = List(grade(
grade("Cambridge", 1, 1001, 4),
grade("Cambridge", 1, 1004, 4),
grade("Cambridge", 2, 1006, 3.5),
grade("Cambridge", 2, 1004, 3.5),
grade("Cambridge", 2, 1002, 3.5),
grade("Cambridge", 3, 1006, 3.5),
grade("Cambridge", 3, 1007, 5),
grade("Cambridge", 3, 1008, 4.5),
grade("MIT", 1, 1001, 4),
grade("MIT", 1, 1002, 4),
grade("MIT", 1, 1003, 4),
grade("MIT", 1, 1004, 4),
grade("MIT", 1, 1005, 3.5),
grade("MIT", 2, 1009, 2))

Upvotes: 0

Views: 662

Answers (2)

gabi
gabi

Reputation: 1376

gradesRdd.map({ case Grade(university: String, courseId: Int, studentId: Int, gpa: Int) => 
   ((university),(courseId))}).mapValues(x => (x, 1))
   .reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
   .mapValues(y => 1.0 * y._1 / y._2).collect
   res73: Array[(String, Double)] = Array((Cambridge,2.125), (MIT,1.1666666666666667))

Upvotes: 0

Nagarjuna Pamu
Nagarjuna Pamu

Reputation: 14825

1) First groupBy university

2) then get course count per university

3) then groupBy courseId

4) then get student count per course

grades.groupBy(_.university).map { case (k, v) =>
    val courseCount = v.map(_.courseId).distinct.length
    val studentCountPerCourse = v.groupBy(_.courseId).map { case (k, v) => v.length }.sum
    k -> (studentCountPerCourse.toDouble / courseCount.toDouble)
  }

Scala REPL

scala> val grades = List(
      grade("Cambridge", 1, 1001, 4),
      grade("Cambridge", 1, 1004, 4),
      grade("Cambridge", 2, 1006, 3.5),
      grade("Cambridge", 2, 1004, 3.5),
      grade("Cambridge", 2, 1002, 3.5),
      grade("Cambridge", 3, 1006, 3.5),
      grade("Cambridge", 3, 1007, 5),
      grade("Cambridge", 3, 1008, 4.5),
      grade("MIT", 1, 1001, 4),
      grade("MIT", 1, 1002, 4),
      grade("MIT", 1, 1003, 4),
      grade("MIT", 1, 1004, 4),
      grade("MIT", 1, 1005, 3.5),
      grade("MIT", 2, 1009, 2))
// grades: List[grade] = List(...)

scala> grades.groupBy(_.university).map { case (k, v) =>
      val courseCount = v.map(_.courseId).distinct.length
      val studentCountPerCourse = v.groupBy(_.courseId).map { case (k, v) => v.length }.sum
      k -> (studentCountPerCourse.toDouble / courseCount.toDouble)
    }
// res2: Map[String, Double] = Map("MIT" -> 3.0, "Cambridge" -> 2.6666666666666665)

Upvotes: 2

Related Questions