Reputation: 1878
I have two datasets,
Dataset1
+---+------+------+
|cid|itemId|bought|
+---+------+------+
|abc| 123| true|
|abc| 345| true|
|abc| 567| true|
|def| 123| true|
|def| 345| true|
|def| 567| true|
+---+------+------+
Dataset2
+---+------+------+
|cid|itemId|bought|
+---+------+------+
|abc| 123| false|
|abc| 345| false|
|def| 789| false|
+---+------+------+
My goal is to merge these two datasets based on following criteria.
CID and ItemId forms a combination (group key). Dataset2 will check the presence of group key in dataset1 if it is present while merging those records will be dropped irrespective of value in 3rd column.
For example in above example 1st and 2nd row of dataset2 will be dropped and only 3rd column will be merged.
Desired output dataset
+---+------+------+
|cid|itemId|bought|
+---+------+------+
|abc| 123| true|
|abc| 345| true|
|abc| 567| true|
|def| 123| true|
|def| 345| true|
|def| 567| true|
|def| 789| false|
+---+------+------+
This is the logic I am trying,
filter dataset2 based on cid and itemid value with intersection on dataset1 and store it as a tempDataset. But I couldn't get a filter which would work by that way. Still hunting for it.
Apply union on tempDataset and dataset1 and create dataset3.
I would like to know is there any simpler approach?
Also how to filter based to get tempDataset?
Upvotes: 0
Views: 67
Reputation: 22439
One approach would be to do a left_anti join
then union
as follows:
import org.apache.spark.sql.functions._
val df1 = Seq(
("abc", 123, true),
("abc", 345, true),
("abc", 567, true),
("def", 123, true),
("def", 345, true),
("def", 567, true)
).toDF("cid", "itemId", "bought")
val df2 = Seq(
("abc", 123, false),
("abc", 345, false),
("def", 789, false)
).toDF("cid", "itemId", "bought")
df1.union( df2.join(df1, Seq("cid", "itemId"), "left_anti") ).
show
// +---+------+------+
// |cid|itemId|bought|
// +---+------+------+
// |abc| 123| true|
// |abc| 345| true|
// |abc| 567| true|
// |def| 123| true|
// |def| 345| true|
// |def| 567| true|
// |def| 789| false|
// +---+------+------+
Upvotes: 3