Krishna Kalyan
Krishna Kalyan

Reputation: 1702

flatten an RDD to get non repeating value pair in Spark using RDD

Consider the schema I have in the dataframe below in scala.

root
  |-- phonetic: string (nullable = true)
  |-- sigID: long (nullable = true)

I am basically grouping by phonetic.

featuers.rdd.groupBy(x => x.apply(0))

which will give me an rdd below

(abc,([1],[2],[3]))
(def,([9],[8]))

How do I flatten this to get a cartesian of (key,([value-a,value-b]))

abc,1,2
abc,1,3
abc,2,3
def,9,8
....

Thanks

Upvotes: 0

Views: 184

Answers (2)

David Griffin
David Griffin

Reputation: 13927

Incidentally, to answer the original question, you can unwind the grouped data like this:

df.rdd.groupBy(x => x.apply(0)).flatMap(t => {
  val longs = t._2.toArray.map(r => r.getLong(1));
  longs.flatMap(l => longs.flatMap(l2 => {
    if (l != l2) Seq((t._1, l, l2));
    else Seq() 
  }))
}).collect

res35: Array[(Any, Long, Long)] = Array((def,9,8), (def,8,9), (abc,1,2), (abc,1,3), (abc,2,1), (abc,2,3), (abc,3,1), (abc,3,2))

Upvotes: 1

David Griffin
David Griffin

Reputation: 13927

You can just leave it as a DataFrame and do this:

val df: DataFrame = ...

df.as("df1").join(
  df.as("df2"),
  ($"df2.phonetic" === $"df1.phonetic") && ($"df1.sigID" !== $"df2.sigID")
).select($"df1.phonetic", $"df1.sigID", $"df2.sigID").show

Upvotes: 1

Related Questions