Govinda
Govinda

Reputation: 73

Sparksql to select certain records against 3 tables

I have 3 tables and need to fetch the records as below

Table_A,
Table_B,
Table_C

Select only Table_A records which are common in Table_B & Table_C and ignore which are not common in both Table_B & Table_C finally results would be no duplicates.

Approach 1 Tried: inner join Table_A with Table_B and again separate inner join Table_A with Table_C finally did union.

Ab = Table_A.join(Table_B,Table_A["id"] == Table_B["id"], "inner").select(common columns)

Ac = Table_A.join(Table_C,Table_A["id"] == Table_C["id"], "inner").select(common columns)

result = Ab.union(Ac) <<Got more duplicates>>
result = result,dropDuplicates(["id"]) 

But still I got the duplicates.

Approach 2 Tried with SparkSql:

Table_A 
left outer 
Table_B
on A.id = B.id 
left outer Table_C 
on A.id = c.id

In this Approach, no duplicates but more records than Table_A also the uncommon records.

Any suggestion and best approach would be apprciated

Upvotes: 1

Views: 55

Answers (1)

GMB
GMB

Reputation: 222622

In Spark SQL, I would recommend exists:

select a.*
from table_a a
where exists (select 1 from table_b b on b.id = a.id)
  and exists (select 1 from table_c c on c.id = a.id)

This does the filtering you want, and will not duplicate records of table_a in the resuletset, even if there are multiple matches in table_b or table_c.

Upvotes: 2

Related Questions