Reputation: 123
I am trying to find an optimised way to generate a list of unique co-location pairings. I have looked to do this using a series of flatmaps and distinct queries but I have found the flatmap to be not overly performant when running over millions of records. Any help in optimising this would be gratefully received.
The dataset is (geohash, id) and I am running this on 30 Node Cluster.
val rdd = sc.parallelize(Seq(("gh5", "id1"), ("gh4", "id1"), ("gh5", "id2"),("gh5", "id3"))
val uniquePairings = rdd.groupByKey().map(value =>
value._2.toList.sorted.combinations(2).map{
case Seq(x, y) => (x, y)}.filter(id =>
id._1 != id._2)).flatMap(x => x).distinct()
voutput = Array(("id1","id2"),("id1","id3"),("id2","id3"))
Upvotes: 0
Views: 100
Reputation: 330383
A simple join
should be more than enough here. For example with DataFrames
:
val df = rdd.toDF
df.as("df1").join(df.as("df2"),
($"df1._1" === $"df2._1") &&
($"df1._2" < $"df2._2")
).select($"df1._2", $"df2._2")
or datasets
val ds = rdd.toDS
ds.as("ds1").joinWith(ds.as("ds2"),
($"ds1._1" === $"ds2._1") &&
($"ds1._2" < $"ds2._2")
).map{ case ((_, x), (_, y)) => (x, y)}
Upvotes: 1
Reputation: 11593
Look into the cartesian function. It produces an RDD that is all possible combinations of the input RDDs. Do note that this is an expensive operation (N^2 in the size of the RDD)
Upvotes: 0