Spark repartitionAndSortWithinPartitions with tuples

Question

I'm trying to follow this example to partition hbase rows: https://www.opencore.com/blog/2016/10/efficient-bulk-load-of-hbase-using-spark/

However, I have data already stored in (String, String, String) where the first is the rowkey, second is column name, and third is column value.

I tried writing an implicit ordering to achieve the OrderedRDD implicit

 implicit val caseInsensitiveOrdering: Ordering[(String, String, String)] = new Ordering[(String, String, String)] {
    override def compare(x: (String, String, String), y: (String, String, String)): Int = ???
  }

but repartitionAndSortWithinPartitions is still not available. Is there a way I can use this method with this tuple?

pasha701 · Accepted Answer

RDD must have key and value, not only values, for ex.:

val data = List((("5", "6", "1"), (1)))
val rdd : RDD[((String, String, String), Int)] = sparkContext.parallelize(data)
implicit val caseInsensitiveOrdering = new Ordering[(String, String, String)] {
  override def compare(x: (String, String, String), y: (String, String, String)): Int = 1
}
rdd.repartitionAndSortWithinPartitions(..)

Spark repartitionAndSortWithinPartitions with tuples

Answers (1)

Related Questions