reduceBykey Spark maintain order

Question

My input dataset looks like

id1, 10, v1
id2, 9, v2
id2, 34, v3
id1, 6, v4
id1, 12, v5
id2, 2, v6

and I want output

id1; 6,v4 | 10,v1 | 12,v5
id2; 2,v6 | 9,v2 | 34,v3

This is such that

id1: array[num(i),value(i)] where num(i) should be sorted

What I have tried:

Get id and 2nd column as key, sortByKey, but since it's a string, sorting doesn't happen like a int, but as string
Get 2nd column as key, sortByKey, then get id and key and 2nd column in value, reduceByKey. But in this case, while doing reduceByKey; order is not preserved. Even groupByKey is not preventing the order. Actually this is expected.

Any help will be appreciated.

zero323 · Accepted Answer

Since you didn't provide any information about input type I assume it is RDD[(String, Int, String)]:

val rdd = sc.parallelize(
    ("id1", 10, "v1") :: ("id2", 9, "v2") ::
    ("id2", 34, "v3") :: ("id1", 6, "v4") :: 
    ("id1", 12, "v5") :: ("id2", 2, "v6") :: Nil)

rdd
  .map{case (id, x, y) => (id, (x, y))}
  .groupByKey
  .mapValues(iter => iter.toList.sortBy(_._1))
  .sortByKey() // Optional if you want id1 before id2

Edit:

To get an output you've described in the comments you can replace function passed to mapValues with something like this:

def process(iter: Iterable[(Int, String)]): String = {
  iter.toList
      .sortBy(_._1)
      .map{case (x, y) => s"$x,$y"}
      .mkString("|")
}

reduceBykey Spark maintain order

Answers (1)

Related Questions