How to pick up the earliest timestamp date from the RDD in scala

Question

I have a RDD which would be like ((String, String), TimeStamp). I have large number of records and I want to select for each key the record with latest TimeStamp value. I have tried the following code and still struggling to to this. Can anybody help me to do this ?

The below code I tried is wrong and not working as well

val context = sparkSession.read.format("jdbc")
  .option("driver", "com.mysql.jdbc.Driver")
  .option("url", url)
  .option("dbtable", "student_risk")
  .option("user", "user")
  .option("password", "password")
  .load()
context.cache();

val studentRDD = context.rdd.map(r => ((r.getString(r.fieldIndex("course_id")), r.getString(r.fieldIndex("student_id"))), r.getTimestamp(r.fieldIndex("risk_date_time"))))
val filteredRDD = studentRDD.collect().map(z => (z._1, z._2)).reduce((x, y) => (x._2.compareTo(y._2)))

Tzach Zohar · Accepted Answer

It's easy to do directly on the DataFrame (oddly named context here):

val result = context
  .groupBy("course_id", "student_id")
  .agg(min("risk_date_time") as "risk_date_time")

Then you can convert it into RDD (if needed) as you did before - the result has the same schema.

If you DO want to perform this over the RDD, use reduceByKey:

studentRDD.reduceByKey((t1, t2) => if (t1.before(t2)) t1 else t2)

How to pick up the earliest timestamp date from the RDD in scala

Answers (2)

Related Questions