lapolonio
lapolonio

Reputation: 1147

How to retrieve record with min value in spark?

Lets say i have a RDD like this -> (String, Date, Int)

[("sam", 02-25-2016, 2), ("sam",02-14-2016, 4), ("pam",03-16-2016, 1), ("pam",02-16-2016, 5)]

and i want to convert it into a list like this ->

[("sam", 02-14-2016, 4), ("pam",02-16-2016, 5)]

where value is record where date is min for each key. What is the best way to do this?

Upvotes: 3

Views: 4227

Answers (1)

Angelo Genovese
Angelo Genovese

Reputation: 3398

I assume that since you tagged the question as being related to spark you mean an RDD as opposed to a list.

making the record into a 2 tuple, with the key as the first element will allow you to use the reduceByKey method, something like this:

rdd
  .map(t => (t._1, (t._2, t._3))
  .reduceByKey((a, b) => if (a._1 < b._1) a else b)
  .map(t => (t._1, t._2._1, t._2._2))

Alternatively, using pattern matching for clarity: (I always find the _* accessors for tuples a bit confusing to read)

rdd
  .map {case (name, date, value) => (name, (date, value))}
  .reduceByKey((a, b) => (a, b) match {
     case ((aDate, aVal), (bDate, bVal)) => 
       if (aDate < bDate) a else b
  })
  .map {case (name, (date, value)) => (name, date, value)}

replace the a._1 < b._1 with whatever comparison is appropriate for the date type you are working with.

see http://spark.apache.org/docs/latest/programming-guide.html#working-with-key-value-pairs for documentation on reduceByKey, and other things you can do with key/value pairs in spark

If you were actually looking to do this with a plain old scala List, the following would work:

list
  .groupBy(_._1)
  .mapValues(l => l.reduce((a, b) => if(a._2 < b._2) a else b))
  .values
  .toList

And again a pattern matched version for clarity:

list
  .groupBy {case (name, date, value) => name}
  .mapValues(l => l.reduce((a, b) => (a,b) match {
    case ((aName, aDate, aValue), (bName, bDate, bValue)) => 
      if(aDate < bDate) a else b
  }))
  .values
  .toList

Upvotes: 5

Related Questions