user2236600
user2236600

Reputation: 167

Scala RDD get earliest date by group

I have a case class RDD in Scala and need to find the earliest date by each group (patientID).

Here is the input:

patientID       date
000000047-01    2008-03-21T21:00:00Z
000000047-01    2007-10-24T19:45:00Z
000000485-01    2011-06-17T21:00:00Z
000000485-01    2006-02-22T18:45:00Z

The expected should be:

patientID       date
000000047-01    2007-10-24T19:45:00Z
000000485-01    2006-02-22T18:45:00Z

I tried something like following but didn't work

val out = medication.groupBy(x => x.patientID).sortBy(x => x.date).take(1)

Upvotes: 0

Views: 100

Answers (1)

Amit Prasad
Amit Prasad

Reputation: 725

Okay! So I understand your question correctly you want the top from every record, if that's the case then here I have created the solution.

 val dataDF = Seq(
            ("000000047-01",    "2008-03-21T21:00:00Z"),
            ("000000047-01" ,   "2007-10-24T19:45:00Z"),
            ("000000485-01",    "2011-06-17T21:00:00Z"),
            ("000000485-01",    "2006-02-22T18:45:00Z"))

  import spark.implicits._
  val dfWithSchema = dataDF.toDF("patientId", "date")
  val winSpec = Window.partitionBy("patientId").orderBy("date")

  val rank_df = dfWithSchema.withColumn("rank", rank().over(winSpec)).orderBy(col("patientId"))
   val result = rank_df.select(col("patientId"),col("date")).where(col("rank") === 1)
  result.show()

Please ignore the steps for creating the DF with the schema if you have already schema defined with your data.

Upvotes: 1

Related Questions