mrn
mrn

Reputation: 21

Solving a sequential problem with Scala apache spark

On Scala Spark, given a sorted dataset of S and B, select pairs of S and B with minimum ranking, ensuring unique values for each S and each B.

Sample input:

|Rank|S  |B  |
|----|---|---|
|   1|S1 |B1 |
|   2|S2 |B1 |
|   3|S3 |B1 |
|   4|S1 |B2 |
|   5|S3 |B1 |
|   6|S2 |B2 |

Sample Output:

|Rank|S  |B  |
|----|---|---|
|   1|S1 |B1 |
|   6|S2 |B2 |

I understand how this can be solved sequentially, however, is it possible to solve it using Spark? If so, how?

Upvotes: 0

Views: 105

Answers (1)

pgrandjean
pgrandjean

Reputation: 766

This is only a partial solution, but I think that if your partitions are made right, you could use mapPartitions to do the job per partition. Something like the following:

val rdd: RDD[(Int, String, String)] = ...
rdd.mapPartitions { it =>
  it.foldLeft(List.empty[(Int, String, String)]) {
    case (Nil, e) => List(e)
    case ((i, ci1, ci2) :: tail, (j, cj1, cj2)) =>
      if (ci1 == cj1 || ci2 == cj2)
        (i, ci1, ci2) :: tail
      else
        (j, cj1, cj2) :: (i, ci1, ci2) :: tail
  }
}

Upvotes: 1

Related Questions