Rengasami Ramanujam
Rengasami Ramanujam

Reputation: 1878

Merge two spark datasets based on two coumn values excluding duplicates

I have two datasets,

Dataset1
+---+------+------+
|cid|itemId|bought|
+---+------+------+
|abc|   123|  true|
|abc|   345|  true|
|abc|   567|  true|
|def|   123|  true|
|def|   345|  true|
|def|   567|  true|
+---+------+------+

Dataset2
+---+------+------+
|cid|itemId|bought|
+---+------+------+
|abc|   123| false|
|abc|   345| false|
|def|   789| false|
+---+------+------+

My goal is to merge these two datasets based on following criteria.

CID and ItemId forms a combination (group key). Dataset2 will check the presence of group key in dataset1 if it is present while merging those records will be dropped irrespective of value in 3rd column.

For example in above example 1st and 2nd row of dataset2 will be dropped and only 3rd column will be merged.

Desired output dataset

+---+------+------+
|cid|itemId|bought|
+---+------+------+
|abc|   123|  true|
|abc|   345|  true|
|abc|   567|  true|
|def|   123|  true|
|def|   345|  true|
|def|   567|  true|
|def|   789| false|
+---+------+------+

This is the logic I am trying,

filter dataset2 based on cid and itemid value with intersection on dataset1 and store it as a tempDataset. But I couldn't get a filter which would work by that way. Still hunting for it.

Apply union on tempDataset and dataset1 and create dataset3.

I would like to know is there any simpler approach?

Also how to filter based to get tempDataset?

Upvotes: 0

Views: 67

Answers (1)

Leo C
Leo C

Reputation: 22439

One approach would be to do a left_anti join then union as follows:

import org.apache.spark.sql.functions._

val df1 = Seq(
  ("abc", 123, true),
  ("abc", 345, true),
  ("abc", 567, true),
  ("def", 123, true),
  ("def", 345, true),
  ("def", 567, true)
).toDF("cid", "itemId", "bought")

val df2 = Seq(
  ("abc", 123, false),
  ("abc", 345, false),
  ("def", 789, false)
).toDF("cid", "itemId", "bought")

df1.union( df2.join(df1, Seq("cid", "itemId"), "left_anti") ).
  show
// +---+------+------+
// |cid|itemId|bought|
// +---+------+------+
// |abc|   123|  true|
// |abc|   345|  true|
// |abc|   567|  true|
// |def|   123|  true|
// |def|   345|  true|
// |def|   567|  true|
// |def|   789| false|
// +---+------+------+

Upvotes: 3

Related Questions