COLD ICE
COLD ICE

Reputation: 850

Joining pairs of key-value with pairs of key-map

I am having this dataset:

(apple,1)
(banana,4)
(orange,3)
(grape,2)
(watermelon,2)

, and the other dataset is:

(apple,Map(Bob -> 1))
(banana,Map(Chris -> 1))
(orange,Map(John -> 1))
(grape,Map(Smith -> 1))
(watermelon,Map(Phil -> 1))

I aiming to combine both sets to get:

(apple,1,Map(Bob -> 1))
(banana,4,Map(Chris -> 1))
(orange,3,Map(John -> 1))
(grape,2,Map(Smith -> 1))
(watermelon,2,Map(Phil -> 1))

The code I have:

...  
val counts_firstDataset = words.map(word => 
(word.firstWord, 1)).reduceByKey{case (x, y) => x + y}

Second dataset:

...
val counts_secondDataset  = secondSet.map(x => (x._1,
x._2.toList.groupBy(identity).mapValues(_.size)))

I tried to use the join method val joined_data = counts_firstDataset.join(counts_secondDataset) but did not work because the join takes pair of [K,V]. How would I get around this issue?

Upvotes: 0

Views: 962

Answers (2)

Shahbaz Shueb
Shahbaz Shueb

Reputation: 420

As first element (name of fruits) of both the lists are in the same order, you can combine the two lists of tuples using zip and then use map to change the list to a tuple in the following way:

counts_firstDataset.zip(counts_secondDataset)
  .map(vk => (vk._1._1, vk._1._2, vk._2._2))

Upvotes: 1

Glennie Helles Sindholt
Glennie Helles Sindholt

Reputation: 13154

The easiest way is just to convert to DataFrames and then join:

import spark.implicits._
val counts_firstDataset = words
  .map(word => (word.firstWord, 1))
  .reduceByKey{case (x, y) => x + y}
  .toDF("type", "value")

val counts_secondDataset = secondSet
  .map(x => (x._1,x._2.toList.groupBy(identity).mapValues(_.size)))
  .toDF("type_2","map")

counts_firstDataset
  .join(counts_secondDataset, 'type === 'type_2)
  .drop('type_2)

Upvotes: 1

Related Questions