Reputation: 21
On Scala Spark, given a sorted dataset of S and B, select pairs of S and B with minimum ranking, ensuring unique values for each S and each B.
Sample input:
|Rank|S |B |
|----|---|---|
| 1|S1 |B1 |
| 2|S2 |B1 |
| 3|S3 |B1 |
| 4|S1 |B2 |
| 5|S3 |B1 |
| 6|S2 |B2 |
Sample Output:
|Rank|S |B |
|----|---|---|
| 1|S1 |B1 |
| 6|S2 |B2 |
I understand how this can be solved sequentially, however, is it possible to solve it using Spark? If so, how?
Upvotes: 0
Views: 105
Reputation: 766
This is only a partial solution, but I think that if your partitions are made right, you could use mapPartitions
to do the job per partition. Something like the following:
val rdd: RDD[(Int, String, String)] = ...
rdd.mapPartitions { it =>
it.foldLeft(List.empty[(Int, String, String)]) {
case (Nil, e) => List(e)
case ((i, ci1, ci2) :: tail, (j, cj1, cj2)) =>
if (ci1 == cj1 || ci2 == cj2)
(i, ci1, ci2) :: tail
else
(j, cj1, cj2) :: (i, ci1, ci2) :: tail
}
}
Upvotes: 1