Reputation: 1147
Lets say i have a RDD like this -> (String, Date, Int)
[("sam", 02-25-2016, 2), ("sam",02-14-2016, 4), ("pam",03-16-2016, 1), ("pam",02-16-2016, 5)]
and i want to convert it into a list like this ->
[("sam", 02-14-2016, 4), ("pam",02-16-2016, 5)]
where value is record where date is min for each key. What is the best way to do this?
Upvotes: 3
Views: 4227
Reputation: 3398
I assume that since you tagged the question as being related to spark you mean an RDD as opposed to a list.
making the record into a 2 tuple, with the key as the first element will allow you to use the reduceByKey method, something like this:
rdd
.map(t => (t._1, (t._2, t._3))
.reduceByKey((a, b) => if (a._1 < b._1) a else b)
.map(t => (t._1, t._2._1, t._2._2))
Alternatively, using pattern matching for clarity: (I always find the _* accessors for tuples a bit confusing to read)
rdd
.map {case (name, date, value) => (name, (date, value))}
.reduceByKey((a, b) => (a, b) match {
case ((aDate, aVal), (bDate, bVal)) =>
if (aDate < bDate) a else b
})
.map {case (name, (date, value)) => (name, date, value)}
replace the a._1 < b._1
with whatever comparison is appropriate for the date type you are working with.
see http://spark.apache.org/docs/latest/programming-guide.html#working-with-key-value-pairs for documentation on reduceByKey, and other things you can do with key/value pairs in spark
If you were actually looking to do this with a plain old scala List, the following would work:
list
.groupBy(_._1)
.mapValues(l => l.reduce((a, b) => if(a._2 < b._2) a else b))
.values
.toList
And again a pattern matched version for clarity:
list
.groupBy {case (name, date, value) => name}
.mapValues(l => l.reduce((a, b) => (a,b) match {
case ((aName, aDate, aValue), (bName, bDate, bValue)) =>
if(aDate < bDate) a else b
}))
.values
.toList
Upvotes: 5