Spark task optimisation

Question

I am trying to find an optimised way to generate a list of unique co-location pairings. I have looked to do this using a series of flatmaps and distinct queries but I have found the flatmap to be not overly performant when running over millions of records. Any help in optimising this would be gratefully received.

The dataset is (geohash, id) and I am running this on 30 Node Cluster.

val rdd = sc.parallelize(Seq(("gh5", "id1"), ("gh4", "id1"), ("gh5", "id2"),("gh5", "id3"))

val uniquePairings = rdd.groupByKey().map(value =>
     value._2.toList.sorted.combinations(2).map{
     case Seq(x, y) => (x, y)}.filter(id => 
     id._1 != id._2)).flatMap(x => x).distinct()       

voutput = Array(("id1","id2"),("id1","id3"),("id2","id3"))

zero323 · Accepted Answer

A simple join should be more than enough here. For example with DataFrames:

val df = rdd.toDF
df.as("df1").join(df.as("df2"),
  ($"df1._1" === $"df2._1") && 
  ($"df1._2" < $"df2._2")
).select($"df1._2", $"df2._2")

or datasets

val ds = rdd.toDS
ds.as("ds1").joinWith(ds.as("ds2"),
  ($"ds1._1" === $"ds2._1") && 
  ($"ds1._2" < $"ds2._2")
).map{ case ((_, x), (_, y)) => (x, y)}

Spark task optimisation

Answers (2)

Related Questions