Reputation: 316
I have 2 datasets as below,
dataset1
+--------------------+----------------+
|ids | names |
+--------------------+-----------------
|1236015 | aaaaa |
|234567 | bbbbb |
|90909090 | ccccc |
+--------------------+-----------------
and this is schema for dataset1
root
|-- ids: string (nullable = true)
|-- names: string (nullable = true)
dataset2
+--------------------+
|ids |
+--------------------+
|1236015 |
|90909090 |
|1345677 |
+--------------------+
and this is schema for dataset
root
|-- ids: string (nullable = true)
I want to remove rows from dataset1 if the ids are present in dataset2 like this
+--------------------+----------------+
|ids | names |
+--------------------+-----------------
|234567 | bbbbb |
|90909090 | ccccc |
---------------------------------------
I tried following
dataset1.join(dataset2,col("ids").notEqual(col("ids")), "semi");
...but it returns all rows from dataset1. What could be the issue?
Upvotes: 0
Views: 839
Reputation: 9417
As Spark doc says:
Anti Join
An anti join returns values from the left relation that has no match with the right. It is also referred to as a left anti join.
So in your case it is probably
dataset1.join(dataset2,dataset1.col("ids").equalTo(dataset2.col("ids")), "leftanti");
Upvotes: 1