Phil Baines
Phil Baines

Reputation: 467

Joining two DataFrames and appending where not exists

I have two DataFrames. One is a MasterList, the other is an InsertList

MasterList:

+--------+--------+
|  ttm_id|audit_id|
+--------+--------+
|       1|      10|
|      15|      10|
+--------+--------+

InsertList:

+--------+--------+
|  ttm_id|audit_id|
+--------+--------+
|       1|      10|
|      15|       9|
+--------+--------+

In Scala, how do I join two DataFrames but only append to the new DataFrame records

WHERE MasterList.ttm_id = InsertList.ttm_id AND
      MasterList.audit_id != InsertList.audit_id

-

ExpectedOutput:

+--------+--------+
|  ttm_id|audit_id|
+--------+--------+
|       1|      10|
|      15|      10|
|      15|       9|
+--------+--------+

Upvotes: 0

Views: 1887

Answers (2)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

It seems that you want to merge rows from insertList dataFrame that are not in masterList dataFrame. This can be achived using except function

insertList.except(masterList)

And you just use union function merge both dataFrames as

masterList.union(insertList.except(masterList))

You should get what you desire as

+------+--------+
|ttm_id|audit_id|
+------+--------+
|1     |10      |
|15    |10      |
|15    |9       |
+------+--------+

Upvotes: 2

Alper t. Turker
Alper t. Turker

Reputation: 35219

I'd anti join (NOT IN) by both columns and union

val masterList = Seq((1, 10), (15, 10)).toDF("ttm_id", "audit_id")
val insertList = Seq((1, 10), (15, 9)).toDF("ttm_id", "audit_id")

insertList
    .join(masterList, Seq("ttm_id", "audit_id"), "leftanti")
    .union(masterList)
    .show
// +------+--------+
// |ttm_id|audit_id|
// +------+--------+
// |    15|       9|
// |     1|      10|
// |    15|      10|
// +------+--------+

Upvotes: 2

Related Questions