Sweety
Sweety

Reputation: 316

Spark 3.1 joining two datasets in Java

I have 2 datasets as below,

dataset1
+--------------------+----------------+
|ids                 | names          |
+--------------------+-----------------
|1236015             | aaaaa          |
|234567              | bbbbb          |
|90909090            | ccccc          |
+--------------------+-----------------

and this is schema for dataset1

root
 |-- ids: string (nullable = true)
 |-- names: string (nullable = true)


dataset2
+--------------------+
|ids                 |
+--------------------+
|1236015             |
|90909090            |
|1345677             |
+--------------------+
and this is schema for dataset
root
 |-- ids: string (nullable = true)

I want to remove rows from dataset1 if the ids are present in dataset2 like this

+--------------------+----------------+
|ids                 | names          |
+--------------------+-----------------
|234567              | bbbbb          |
|90909090            | ccccc          |
---------------------------------------

I tried following

 dataset1.join(dataset2,col("ids").notEqual(col("ids")), "semi");

...but it returns all rows from dataset1. What could be the issue?

Upvotes: 0

Views: 839

Answers (1)

mazaneicha
mazaneicha

Reputation: 9417

As Spark doc says:

Anti Join
An anti join returns values from the left relation that has no match with the right. It is also referred to as a left anti join.

So in your case it is probably

dataset1.join(dataset2,dataset1.col("ids").equalTo(dataset2.col("ids")), "leftanti");

Upvotes: 1

Related Questions