Reputation: 1702
Consider the schema I have in the dataframe below in scala.
root
|-- phonetic: string (nullable = true)
|-- sigID: long (nullable = true)
I am basically grouping by phonetic.
featuers.rdd.groupBy(x => x.apply(0))
which will give me an rdd below
(abc,([1],[2],[3]))
(def,([9],[8]))
How do I flatten this to get a cartesian of (key,([value-a,value-b]))
abc,1,2
abc,1,3
abc,2,3
def,9,8
....
Thanks
Upvotes: 0
Views: 184
Reputation: 13927
Incidentally, to answer the original question, you can unwind the grouped data like this:
df.rdd.groupBy(x => x.apply(0)).flatMap(t => {
val longs = t._2.toArray.map(r => r.getLong(1));
longs.flatMap(l => longs.flatMap(l2 => {
if (l != l2) Seq((t._1, l, l2));
else Seq()
}))
}).collect
res35: Array[(Any, Long, Long)] = Array((def,9,8), (def,8,9), (abc,1,2), (abc,1,3), (abc,2,1), (abc,2,3), (abc,3,1), (abc,3,2))
Upvotes: 1
Reputation: 13927
You can just leave it as a DataFrame
and do this:
val df: DataFrame = ...
df.as("df1").join(
df.as("df2"),
($"df2.phonetic" === $"df1.phonetic") && ($"df1.sigID" !== $"df2.sigID")
).select($"df1.phonetic", $"df1.sigID", $"df2.sigID").show
Upvotes: 1