chyoo CHENG
chyoo CHENG

Reputation: 720

how do I exclude some data from RDD in scala and spark?

I am new to scala and spark.Now I have a RDD and the data like this:

[
 (key1, compactbuffer(item1, item2, item3)),
 (key2, compactbuffer(item3, item4))
 .....
]

the another RDD is:

[item1, item2, item3, item4, item5, item6]// it's ordered.

Then I want to get the Result like this:

[
   (key1, compactbuffer(item4, item5, item6), 
   (key2, compactbuffer(item1, item2, item5, item6)
]

how do I achieve it?

Upvotes: 0

Views: 574

Answers (1)

David Griffin
David Griffin

Reputation: 13927

Assuming the two RDDs were named grouped and master, this should do it:

grouped.cartesian(master).filter(t => {
  var found = false;
  t._1._2.foreach(r => {if (r._2 == t._2) found = true});
  !found
}).map(t => (t._1._1, t._2)).groupBy(x => x._1)

Upvotes: 1

Related Questions