flatten an RDD to get non repeating value pair in Spark using RDD

Question

Consider the schema I have in the dataframe below in scala.

root
  |-- phonetic: string (nullable = true)
  |-- sigID: long (nullable = true)

I am basically grouping by phonetic.

featuers.rdd.groupBy(x => x.apply(0))

which will give me an rdd below

(abc,([1],[2],[3]))
(def,([9],[8]))

How do I flatten this to get a cartesian of (key,([value-a,value-b]))

abc,1,2
abc,1,3
abc,2,3
def,9,8
....

Thanks

David Griffin · Accepted Answer

You can just leave it as a DataFrame and do this:

val df: DataFrame = ...

df.as("df1").join(
  df.as("df2"),
  ($"df2.phonetic" === $"df1.phonetic") && ($"df1.sigID" !== $"df2.sigID")
).select($"df1.phonetic", $"df1.sigID", $"df2.sigID").show

flatten an RDD to get non repeating value pair in Spark using RDD

Answers (2)

Related Questions