Reputation: 35978
I have two tables: excluded and kaggleresults. I am trying to find records that exist in excluded
but don't exist in kaggleresults
counts:
scala> spark.sql("select * from excluded").count()
res136: Long = 4652
scala> spark.sql("select * from kaggleresults").count()
res137: Long = 4635
The difference is 17
scala> res136-res137
res139: Long = 17
I am trying to get those 17 records. I wrote the query below but it returns 38
instead.
scala> spark.sql("select * from excluded left join kaggleresults on kaggleresults.subject_id = excluded.subject_id where kaggleresults.subject_id is null").count()
res135: Long = 38
Question
What query do I need to write to get those 17 records?
Upvotes: 3
Views: 10346
Reputation: 74679
Isn't it a case for LEFT_ANTI join?
scala> val excluded = (0 to 5).toDS
left: org.apache.spark.sql.Dataset[Int] = [value: int]
scala> val kaggleresults = (3 to 10).toDS
right: org.apache.spark.sql.Dataset[Int] = [value: int]
scala> excluded.join(kaggleresults, Seq("value"), "leftanti").show
+-----+
|value|
+-----+
| 0|
| 1|
| 2|
+-----+
Upvotes: 9
Reputation: 46
For the actual records (not the counts), you can use
SELECT * FROM excluded
WHERE subject_id NOT IN ( SELECT subject_id FROM kaggleresults )
However, you should not be surprised if the resultant row count does not match the difference of the two tables' individual row counts.
For example: suppose table1 has id's 1, 2, 3, 4, and 5, and table2 has id's 3, 4, 5, and 6. The row counts of table1 and of table2 are 5 and 4, respectively, for a difference of 1, but there are actually two records in table1 that have id's that do not exist in table2.
Upvotes: 1
Reputation: 542
using not in key word You can use the below query
Select * from excluded where subject_id not in (select subject_id from kaggleresults)
Upvotes: 0
Reputation: 143
Use the Not In
syntax along with a subquery.
SELECT ID, Name
FROM Table1
WHERE ID NOT IN (SELECT ID FROM Table2)
SQL query to find record with ID not in another table
Upvotes: 0