Reputation: 850
I am having this dataset:
(apple,1)
(banana,4)
(orange,3)
(grape,2)
(watermelon,2)
, and the other dataset is:
(apple,Map(Bob -> 1))
(banana,Map(Chris -> 1))
(orange,Map(John -> 1))
(grape,Map(Smith -> 1))
(watermelon,Map(Phil -> 1))
I aiming to combine both sets to get:
(apple,1,Map(Bob -> 1))
(banana,4,Map(Chris -> 1))
(orange,3,Map(John -> 1))
(grape,2,Map(Smith -> 1))
(watermelon,2,Map(Phil -> 1))
The code I have:
...
val counts_firstDataset = words.map(word =>
(word.firstWord, 1)).reduceByKey{case (x, y) => x + y}
Second dataset:
...
val counts_secondDataset = secondSet.map(x => (x._1,
x._2.toList.groupBy(identity).mapValues(_.size)))
I tried to use the join method val joined_data = counts_firstDataset.join(counts_secondDataset)
but did not work because the join takes pair of [K,V]. How would I get around this issue?
Upvotes: 0
Views: 962
Reputation: 420
As first element (name of fruits) of both the lists are in the same order, you can combine the two lists of tuples using zip and then use map to change the list to a tuple in the following way:
counts_firstDataset.zip(counts_secondDataset)
.map(vk => (vk._1._1, vk._1._2, vk._2._2))
Upvotes: 1
Reputation: 13154
The easiest way is just to convert to DataFrames
and then join
:
import spark.implicits._
val counts_firstDataset = words
.map(word => (word.firstWord, 1))
.reduceByKey{case (x, y) => x + y}
.toDF("type", "value")
val counts_secondDataset = secondSet
.map(x => (x._1,x._2.toList.groupBy(identity).mapValues(_.size)))
.toDF("type_2","map")
counts_firstDataset
.join(counts_secondDataset, 'type === 'type_2)
.drop('type_2)
Upvotes: 1